Constructing a Part-of-Speech Tagging based on Lexicon and Rule-based for Sundanese Corpus

Ade Sutedi; Ayu Latifah; Novan Rodiansyah; Yayat Sudaryat

doi:10.52436/1.jutif.2026.7.3.5361

Authors

Ade Sutedi Institut Teknologi Garut, Indonesia
Ayu Latifah Institut Teknologi Garut, Indonesia
Novan Rodiansyah Institut Teknologi Garut, Indonesia
Yayat Sudaryat Universitas Pendidikan Indonesia, Indonesia

DOI:

https://doi.org/10.52436/1.jutif.2026.7.3.5361

Keywords:

Annotation, Corpus, Part-Of-Speech, Sundanese

Abstract

Part-of-Speech (POS) Tagging is the process of annotating word classes (nouns, verbs, adjectives, etc.) in a sentence, which is used as a basis for natural language processing and artificial intelligence. In this study, a corpus of word classes and word class annotating rules for the Sundanese language, which has limited resources, was developed. The experiments were conducted on an annotated corpus consisting of 104,696 tokens collected from Sundanese dictionaries, Sundanese Literature (Carita Pondok, Guguritan, Mantra, Pupujian, Sisindiran, Sajak, and Wawacan), Babasan and Paribasa, and social media X (Twitter). The annotation process is carried out in several stages that combine manual annotation based on cross-lingual transfer from Indonesian POS to Sundanese POS, then adjusted based on the word class rules in Sundanese. The results of this study are a POS annotation corpus containing Sundanese word-tag pairs and a basic rule-based model compared to the HMM and CRF models. The rule-based model achieves an F1-score of 0.867, the CRF model achieves an F1-score of 0.889, while the HMM model attains the highest score with an F1-score of 1.000. Analysis of POS distributions reveals that nouns (KB) consistently dominate across all models, reflecting the noun-rich nature of Sundanese literary texts. It also highlights the challenges of handling unknown words and the need for richer annotated resources, which are related to tag interoperability with Universal POS standards. This research contributes to the development of NLP resources for low-resource languages and provides a methodological foundation for future Sundanese NLP applications.

Downloads

Download data is not yet available.

References

D. Jurafsky and J. H. Martin, Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, with Language Models, 3rd ed. 2026. [Online]. Available: https://web.stanford.edu/~jurafsky/slp3/

E. Altuncu, J. R. C. Nurse, Y. Xu, J. Guo, and S. Li, “Improving Performance of Automatic Keyword Extraction (AKE) Methods Using PoS Tagging and Enhanced Semantic-Awareness,” Information, vol. 16, no. 7, p. 601, Jul. 2025, doi: 10.3390/info16070601.

M. M. Aziz, A. A. Bakar, and M. R. Yaakub, “CoreNLP dependency parsing and pattern identification for enhanced opinion mining in aspect-based sentiment analysis,” J. King Saud Univ. - Comput. Inf. Sci., vol. 36, no. 4, p. 102035, Apr. 2024, doi: 10.1016/j.jksuci.2024.102035.

S. O. Khairunnisa, Z. Chen, and M. Komachi, “Dataset Enhancement and Multilingual Transfer for Named Entity Recognition in the Indonesian Language,” ACM Trans. Asian Low-Resour. Lang. Inf. Process., vol. 22, no. 6, pp. 1–21, Jun. 2023, doi: 10.1145/3592854.

Z. Z. Hlaing, Y. K. Thu, T. Supnithi, and P. Netisopakul, “Improving neural machine translation with POS-tag features for low-resource language pairs,” Heliyon, vol. 8, no. 8, p. e10375, Aug. 2022, doi: 10.1016/j.heliyon.2022.e10375.

N. Fatima, S. M. Daudpota, Z. Kastrati, A. S. Imran, S. Hassan, and N. S. Elmitwally, “Improving news headline text generation quality through frequent POS-Tag patterns analysis,” Eng. Appl. Artif. Intell., vol. 125, p. 106718, Oct. 2023, doi: 10.1016/j.engappai.2023.106718.

S. Chotirat and P. Meesad, “Part-of-Speech tagging enhancement to natural language processing for Thai wh-question classification with deep learning,” Heliyon, vol. 7, no. 10, p. e08216, Oct. 2021, doi: 10.1016/j.heliyon.2021.e08216.

A. Maulana and A. Romadhony, “Domain Adaptation for Part-of-Speech Tagging of Indonesian Text Using Affix Information,” Procedia Comput. Sci., vol. 179, pp. 640–647, 2021, doi: 10.1016/j.procs.2021.01.050.

M. Kurniawan, K. Kusrini, and M. R. Arief, “Part of Speech Tagging Pada Teks Bahasa Indonesia dengan BiLSTM + CNN + CRF dan ELMo,” J. Eksplora Inform., vol. 11, no. 1, pp. 29–37, Jan. 2022, doi: 10.30864/eksplora.v11i1.506.

D. E. Cahyani and W. Mustikaningtyas, “Indonesian part of speech tagging using maximum entropy markov model on Indonesian manually tagged corpus,” IAES Int. J. Artif. Intell. IJ-AI, vol. 11, no. 1, p. 336, Mar. 2022, doi: 10.11591/ijai.v11.i1.pp336-344.

M. Mursyit, A. P. Wibawa, I. A. E. Zaeni, and H. A. Rosyid, “Pelabelan Kelas Kata Bahasa Jawa Menggunakan Hidden Markov Model,” Mob. Forensics, vol. 2, no. 2, pp. 71–83, Aug. 2020, doi: 10.12928/mf.v2i2.2450.

R. A. Pratama, A. A. Suryani, and W. Maharani, “Part of Speech Tagging for Javanese Ngoko Language with Hidden Markov Model,” vol. 4, no. 1, 2020.

N. P. Dewi and U. Ubaidi, “POS Tagging Bahasa Madura dengan Menggunakan Algoritma Brill Tagger,” J. Teknol. Inf. Dan Ilmu Komput., vol. 7, no. 6, pp. 1121–1128, Dec. 2020, doi: 10.25126/jtiik.2020722449.

I. Firmansyah, P. P. Adikara, and S. Adinugroho, “Klasifikasi Kelas Kata (Part-Of-Speech Tagging) untuk Bahasa Madura Menggunakan Algoritme Viterbi,” J. Teknol. Inf. Dan Ilmu Komput., vol. 8, no. 5, pp. 1039–1048, Oct. 2021, doi: 10.25126/jtiik.2021854483.

A. Sumoko, A. B. P. Negara, and H. S. Pratiwi, “Perbandingan Tipe Metode PoS Tagger Terhadap Nilai Akurasi Untuk Bahasa Melayu Pontianak,” J. Sist. Dan Teknol. Inf. Justin, vol. 9, no. 3, p. 342, Aug. 2021, doi: 10.26418/justin.v9i3.44116.

S. Ullah et al., “A Deep Learning-Based Approach for Part of Speech (PoS) Tagging in the Pashto Language,” IEEE Access, vol. 12, pp. 86355–86364, 2024, doi: 10.1109/ACCESS.2024.3412175.

A. Zilziana, A. A. Suryani, and I. Asror, “Part Of Speech Tagging Menggunakan Bahasa Jawa Dengan Metode Condition Random Fields”.

D. Hoesen and A. Purwarianti, “Investigating Bi-LSTM and CRF with POS Tag Embedding for Indonesian Named Entity Tagger,” in 2018 International Conference on Asian Language Processing (IALP), Bandung, Indonesia: IEEE, Nov. 2018, pp. 35–38. doi: 10.1109/IALP.2018.8629158.

A. Chiche and B. Yitagesu, “Part of speech tagging: a systematic review of deep learning and machine learning approaches,” J. Big Data, vol. 9, no. 1, p. 10, Jan. 2022, doi: 10.1186/s40537-022-00561-y.

A. A. Kha et al., “Comparison of Machine Learning and Deep Learning Models for Part-of-Speech Tagging”.

M. Alfian, U. L. Yuhana, and D. Siahaan, “Indonesian Part-of-Speech Tagger: A Comparative Study,” in 2023 10th International Conference on Advanced Informatics: Concept, Theory and Application (ICAICTA), Lombok, Indonesia: IEEE, Oct. 2023, pp. 1–6. doi: 10.1109/ICAICTA59291.2023.10390353.

M. Kamayani, “Perkembangan Part-of-Speech Tagger Bahasa Indonesia,” J. Linguist. Komputasional JLK, vol. 2, no. 2, p. 34, Sep. 2019, doi: 10.26418/jlk.v2i2.20.

A. Benlahbib, A. Boumhidi, A. Fahfouh, and H. Alami, “Comparative Analysis of Traditional and Modern NLP Techniques on the CoLA Dataset: From POS Tagging to Large Language Models,” IEEE Open J. Comput. Soc., vol. 6, pp. 248–260, 2025, doi: 10.1109/OJCS.2025.3526712.

W. Wongso, H. Lucky, and D. Suhartono, “Pre-trained transformer-based language models for Sundanese,” J. Big Data, vol. 9, no. 1, p. 39, Dec. 2022, doi: 10.1186/s40537-022-00590-7.

Y. Sudaryat, Struktur bahasa Sunda: sintaksis dalam gamitan pragmatik, Cetakan pertama. Bandung, Indonesia: UPI Press, 2019.

D. Soyusiawaty and A. Fadlil, “Pengembangan Korpus Bahasa Minang pada Spell Error Corpus for Minang Language (SPEML),” vol. 11, no. 01, 2025.

A. Sulastri1, “Geolinguistik: Variasi Dialek Dan Lemahnya Pemertahanan Bahasa Sunda Oleh Generasi Muda,” J. Geogr., vol. 13, no. 1, pp. 38–46, Oct. 2024, doi: 10.24036/geografi/vol13-iss1/3970.

R. A. Danadibrata, Kamus basa Sunda, Cet. 1. Bandung: Wedalan Panitia Penerbitan Kamus Basa Sunda, gawe bareng PT Kiblat Buku Utama,jeung Universitas Padjadjaran, 2006.

D. Koswara, “Racikan Sastra,” Bdg. Jur. Pendidik. Bhs. Drh. UPI, 2013.

A. Rosidi, Babasan & paribasa: kabeungharan basa Sunda. Kiblat Buku Utama, 2022.

M.-C. De Marneffe, C. D. Manning, J. Nivre, and D. Zeman, “Universal Dependencies,” Comput. Linguist., pp. 1–54, May 2021, doi: 10.1162/coli_a_00402.

Y. Sudaryat, A. Prawirasumantri, and K. Yudibrata, Tata basa Sunda kiwari, Cet. 1. Bandung: Yrama Widya, 2007.

A. ARDIYANTI SURYANI, D. H. Widyantoro, A. Purwarianti, and Y. Sudaryat, “PoSTagged Sundanese Monolingual Corpus.” Telkom University Dataverse, 2022. doi: 10.34820/FK2/VTAHRH.

Constructing a Part-of-Speech Tagging based on Lexicon and Rule-based for Sundanese Corpus

Authors

DOI:

Keywords:

Abstract

Downloads

References

Additional Files

Published

How to Cite

Issue

Section

License

Most read articles by the same author(s)

Make a Submission

sidebar

Information