Complex Word Identification in Indonesian Children’s Texts: An IndoBERT Baseline and Error Analysis

Authors

  • Lisnawita Faculty of Computer Science, Universitas Lancang Kuning, Indonesia
  • Juhaida Abu Bakar Data Science Research Lab, School of Computing, Universiti Utara Malaysia, Malaysia
  • Ruziana Mohamad Rasli School of Multimedia Technology & Communication, Universiti Utara Malaysia, Malaysia
  • Loneli Costaner Faculty of Computer Science, Universitas Lancang Kuning, Indonesia
  • Guntoro Faculty of Computer Science, Universitas Lancang Kuning, Indonesia

DOI:

https://doi.org/10.52436/1.jutif.2025.6.6.5501

Keywords:

complex word identification, error analysis, IndoBERT, Indonesian children’s texts, text simplification, token classification

Abstract

Complex Word Identification (CWI) is a crucial step for building text simplification systems, especially for Indonesian children’s reading materials where unfamiliar vocabulary can hinder comprehension. This study formulates token-level CWI for Indonesian children’s texts and establishes two baselines:  an interpretable rule-based model using linguistic features e.g., length, syllable heuristics, and affix patterns, and an IndoBERT model fine-tuned for token classification. This study construct and annotate a children’s text corpus and evaluate both approaches using standard classification metrics. On the test set (22.584 tokens), IndoBERT achieves an F1-score of 0.9972 for the CWI class, substantially outperforming the rule-based baseline (F1 = 0.8607). The IndoBERT system makes only 39 errors (23 false positives and 16 false negatives), indicating near-perfect performance under the evaluated setting. Furthermore, this study provides an error analysis to highlight remaining failure patterns and borderline cases that are difficult even for contextual models. The resulting benchmark and findings contribute to Informatics/Computer Science by providing a strong baseline and analysis for educational NLP in a low-resource language setting, supporting the development of Indonesian child-oriented NLP resources and downstream text simplification tools.

Downloads

Download data is not yet available.

References

S. A. Bahrainian, J. Dou, and C. Eickhoff, “Text Simplification via Adaptive Teaching,” 2024. https:/doi.org/ 10.18653/v1/2024.findings-acl.392

M. Anschütz, J. Oehms, T. Wimmer, B. Jezierski, and G. Groh, “Language Models for German Text Simplification: Overcoming Parallel Data Scarcity through Style-specific Pre-training,” Proceedings of the Annual Meeting of the Association for Computational Linguistics, pp. 1147–1158, 2023, https://doi.org/10.18653/v1/2023.findings-acl.74.

S. Alissa and M. Wald, “Text Simplification Using Transformer and BERT,” Computers, Materials and Continua,vol.75,no.2, pp. 3479–3495, 2023, https://doi.org/10.32604/cmc.2023.033647.

M. Shardlow, “Predicting lexical complexity in English texts: the Complex 2.0 dataset,” Lang Resour Eval, vol. 56, no. 4, pp. 1153–1194, 2022, https://doi.org/10.1007/s10579-022-09588-2.

A. Aziz, “CSECU-DSG at SemEval-2021 Task 1: Fusion of Transformer Models for Lexical Complexity Prediction,” 2021. https://doi.org/10.18653/v1/2021.semeval-1.80

G. E. Zaharia, “Domain Adaptation in Multilingual and Multi-Domain Monolingual Settings for Complex Word Identification,” 2022. https://doi.org/10.18653/v1/2022.acl-long.6

J. Ortiz-Zambrano, “SINAI at SemEval-2021 Task 1: Complex word identification using Word-level features,” 2021. https://doi.org/ 10.18653/v1/2021.semeval-1.11

E. Zotova, “Vicomtech at alexs 2020: Unsupervised complex word identification based on domain frequency,” 2020. https://ceur-ws.org/Vol-2664/alesx_paper1.pdf

J. R. Irina Rets, “To simplify or not? Facilitating English L2 users’ comprehension and processing of open educational resources in English using text simplification,” 2020, Wiley. https://doi.org/ 10.1111/jcal.12517/v1/decision1.

J. Qiang, Y. Li, Y. Zhu, Y. Yuan, Y. Shi, and X. Wu, “LSBert: Lexical Simplification Based on BERT,” IEEE/ACM Trans Audio Speech Lang Process, vol. 29, pp. 3064–3076, 2021, https://doi.org/ 10.1109/TASLP.2021.3111589.

M. S. Wibowo, “Lexical and Syntactic Simplification for Indonesian Text,” 2019. https://doi.org/10.1109/ISRITI48646.2019.9034582.

P. Śliwiak and S. A. A. Shah, “Text-to-text generative approach for enhanced complex word identification,” Neurocomputing, vol. 610, Dec. 2024, https://doi.org/10.1016/j.neucom.2024.128501

R. Azvan-Alexandru, S. Adu, D.-G. Ion, D.-C. Cercel, F. Pop, and M.-C. Cercel, “Investigating Large Language Models for Complex Word Identification in Multilingual and Multidomain Setups,” 2014. https://doi.org/10.18653/v1/2024.emnlp-main.933

A. Kelious, M. Constant, and C. Coeur, “Complex Word Identification: a Comparative Study Between ChatGPT and a Dedicated Model for this Task,” 2024. https://aclanthology.org/2024.lrec-main.323.pdf

S. Gooding and E. Kochmar, “Complex Word Identification as a Sequence Labelling Task,” Association for Computational Linguistics, 2019. https://doi.org/10.18653/v1/P19-1109

K. North, M. Zampieri, and M. Shardlow, “Lexical Complexity Prediction: An Overview,” Sep. 30, 2023, Association for Computing Machinery. https://doi.org/10.1145/3557885

Z. Yuan, G. Tyen, and D. Strohmaier, “Cambridge at SemEval-2021 Task 1: An Ensemble of Feature-Based and Neural Models for Lexical Complexity Prediction,” 2021. https://doi.org/10.18653/v1/2021.semeval-1.74

T. Han, X. Zhang, Y. Bi, M. Mulvenna, and D. Yang, “From Complex Word Identification to Substitution: Instruction-Tuned Language Models for Lexical Simplification,” 2025. https://doi.org/10.18653/v1/2025.starsem-1.4

A. Tucker, “An investigation of complex word identification (CWI) systems for English,” 2023. https://home.cltl.labs.vu.nl/static/data

Anderies, R. Rahutomo, and B. Pardamean, “Finetunning IndoBERT to Understand Indonesian Stock Trader Slang Language,” Proceedings of 2021 1st International Conference on Computer Science and Artificial Intelligence, ICCSAI 2021, no. October, pp. 42–46, 2021, https://doi.org/10.1109/ICCSAI53272.2021.9609746.

S. M. Isa, “Indobert For Indonesian Fake News Detection,” ICIC Express Letters, vol. 16, no. 3, pp. 289–297, 2022, https://doi.org/10.24507/icicel.16.03.289.

E. Fernandez, “Improving IndoBERT for Sentiment Analysis on Indonesian Stock Trader Slang Language,” 2022. https://doi.org/10.1109/IoTaIS56727.2022.9975975.

F. Baharuddin and M. F. Naufal, “Fine-Tuning IndoBERT for Indonesian Exam Question Classification Based on Bloom’s Taxonomy,” Journal of Information Systems Engineering and Business Intelligence, vol. 9, no. 2, pp. 253–263, Oct. 2023, https://doi.org/10.20473/jisebi.9.2.253-263.

F. Koto, “IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model for Indonesian NLP,” 2020. https://doi.org/10.18653/v1/2020.coling-main.66

B. Wilie et al., “IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding,” pp. 843–857, 2020, http://arxiv.org/abs/2009.05387

S. Cahyawijaya et al., “IndoNLG: Benchmark and Resources for Evaluating Indonesian Natural Language Generation, 2021” https://doi.org/10.18653/v1/2021.emnlp-main.

W. Zhou, T. Ge, K. Xu, F. Wei, and M. Zhou, “BERT-based Lexical Substitution,” 2019, Association for Computational Linguistics. https://doi.org/10.18653/v1/p19-1328.

N. El Mamoun, A. El Mahdaouy, A. El Mekki, K. Essefar, and I. Berrada, “CS-UM6P at SemEval-2021 Task 1: A Deep Learning Model-based Pre-trained Transformer Encoder for Lexical Complexity,” SemEval 2021 - 15th International Workshop on Semantic Evaluation, Proceedings of the Workshop, pp. 585–589, 2021, https://doi.org/10.18653/v1/2021.semeval-1.73.

J. A. Ortiz-Zambrano, C. Espin-Riofrio, and A. Montejo-Ráez, “Combining Transformer Embeddings with Linguistic Features for Complex Word Identification,” Electronics (Switzerland), vol. 12, no. 1, pp. 1–10, 2023, https://doi.org/10.3390/electronics12010120.

G. E. Zaharia, D. C. Cercel, and M. Dascalu, “Cross-Lingual Transfer Learning for Complex Word Identification,” Proceedings - International Conference on Tools with Artificial Intelligence, ICTAI, vol. 2020-Novem, pp. 384–390, 2020, https://doi.org/ 10.1109/ICTAI50040.2020.00067.

R. Flynn and M. Shardlow, “Manchester Metropolitan at SemEval-2021 Task 1: Convolutional Networks for Complex Word Identification,” SemEval 2021 - 15th International Workshop on Semantic Evaluation, Proceedings of the Workshop, pp. 603–608, 2021, https://doi.org/ 10.18653/v1/2021.semeval-1.76.

A. M. Butnaru, “UnibucKernel: A kernel-based learning method for complex word identification,” 2018. https://doi.org/10.18653/v1/W18-0519

D. de Hertog, “Deep learning architecture for Complex Word Identification,” 2018. https://doi.org/ 10.18653/v1/W18-0539.

Additional Files

Published

2026-01-05

How to Cite

[1]
L. Lisnawita, J. A. . Bakar, R. M. . Rasli, L. Costaner, and G. Guntoro, “Complex Word Identification in Indonesian Children’s Texts: An IndoBERT Baseline and Error Analysis”, J. Tek. Inform. (JUTIF), vol. 6, no. 6, pp. 5976–5987, Jan. 2026.