Optimizing Bag of Words and Word2Vec with Vocabulary Pruning and TF-IDF Weighted Embeddings for Accurate Chatbot Responses in Indonesian Treasury Services

Authors

  • Eko Aprianto Faculty of Information Technology, Universitas Budi Luhur, Jakarta, Indonesia
  • Deni Mahdiana Faculty of Information Technology, Universitas Budi Luhur, Jakarta, Indonesia
  • Arief Wibowo Faculty of Information Technology, Universitas Budi Luhur, Jakarta, Indonesia

DOI:

https://doi.org/10.52436/1.jutif.2026.7.1.5370

Keywords:

Bag Of Words, Chatbot Development, TF-IDF Weighting, Vocabulary Pruning, Word2Vec

Abstract

The high volume of support tickets submitted to the HAI DJPb Service Desk has caused delays and inconsistent response quality in payroll-related inquiries across Indonesian treasury work units (Satker). To improve the accuracy and efficiency of public service responses, this research proposes an optimized text-vectorization framework for chatbot development using a hybrid combination of Bag of Words (BoW), Word2Vec, vocabulary pruning, and TF-IDF weighted embeddings. The dataset consists of 2024 ticket logs, curated FAQs, and questionnaire data related to the Satker Web Payroll Application. The method includes preprocessing (snippet removal, normalization, tokenization, stopword removal, stemming), vocabulary pruning based on empirical frequency thresholds (<5 and >80) while preserving domain-specific technical terms, and semantic weighting through TF-IDF. Four vectorization models—BoW, BoW with pruning, Word2Vec, and Word2Vec + TF-IDF—were evaluated using cosine similarity, response time, and accuracy. Results show that BoW achieved the highest accuracy of 88.32%, while Word2Vec produced the most stable response time with an average of 47.32 ms and a cosine similarity of 0.99. The findings demonstrate that frequency-based representations remain highly effective for structured administrative datasets, while weighted embeddings improve semantic relevance. This study contributes to the field of Informatics by providing an efficient hybrid vectorization framework tailored for Indonesian administrative language, enabling more accurate and scalable chatbot solutions for e-government services.

Downloads

Download data is not yet available.

References

P. Ou dan C. Zhang, “Exploring the contextual factors affecting financial shared service implementation and firm performance,” J. Enterp. Inf. Manag., vol. 38, hal. 152–175, 2023, doi: 10.1108/jeim-04-2022-0126.

A. Pinem, A. Yeskafauzan, P. Handayani, F. Azzahro, A. Hidayanto, dan D. Ayuningtyas, “Designing a health referral mobile application for high-mobility end users in Indonesia,” Heliyon, vol. 6, 2020, doi: 10.1016/j.heliyon.2020.e03174.

M. Shbool, A. Al-Bazi, dan R. Al-Hadeethi, “The effect of customer satisfaction on parcel delivery operations using autonomous vehicles: An agent-based simulation study,” Heliyon, vol. 8, 2022, doi: 10.1016/j.heliyon.2022.e09409.

S. Senadheera et al., “Understanding Chatbot Adoption in Local Governments: A Review and Framework,” J. Urban Technol., 2024, doi: 10.1080/10630732.2023.2297665.

D. Yan, K. Li, S. Gu, dan L. Yang, “Network-Based Bag-of-Words Model for Text Classification,” IEEE Access, vol. 8, hal. 82641–82652, 2020, doi: 10.1109/access.2020.2991074.

A. Iqbal, A. Shahid, M. Roman, M. T. Afzal, dan U. U. Hassan, “Optimising window size of semantic of classification model for identification of in-text citations based on context and intent,” PLoS One, vol. 20, 2025, doi: 10.1371/journal.pone.0309862.

C. Li, Z. Xie, dan H. Wang, “Short Text Classification Based on Enhanced Word Embedding and Hybrid Neural Networks,” Appl. Sci., 2025, doi: 10.3390/app15095102.

R. Tinn et al., “Fine-tuning large neural language models for biomedical natural language processing,” Patterns, vol. 4, 2021, doi: 10.1016/j.patter.2023.100729.

J. Bird, A. Ek’art, dan D. Faria, “Chatbot Interaction with Artificial Intelligence: human data augmentation with T5 and language transformer ensemble for text classification,” J. Ambient Intell. Humaniz. Comput., vol. 14, hal. 3129–3144, 2020, doi: 10.1007/s12652-021-03439-8.

M. Kuhail, N. Alturki, S. Alramlawi, dan K. Alhejori, “Interacting with educational chatbots: A systematic review,” Educ. Inf. Technol., vol. 28, hal. 973–1018, 2022, doi: 10.1007/s10639-022-11177-3.

J. Santoso, E. Setiawan, E. Yuniarno, M. Hariadi, dan M. Purnomo, “Hybrid Conditional Random Fields and K-Means for Named Entity Recognition on Indonesian News Documents,” Int. J. Intell. Eng. Syst., 2020, doi: 10.22266/ijies2020.0630.22.

D. Imbang et al., “A Contrastive Morphological Analysis of the Tombulu Dialect of the Minahasa Language and Indonesian in the Context of Local Language Instruction,” J. Posthumanism, 2025, doi: 10.63332/joph.v5i3.946.

M. Amin, E. Cambria, B. Schuller, dan E. Cambria, “Will Affective Computing Emerge From Foundation Models and General Artificial Intelligence? A First Evaluation of ChatGPT,” IEEE Intell. Syst., vol. 38, hal. 15–23, 2023, doi: 10.1109/mis.2023.3254179.

P. Zicari, G. Folino, M. Guarascio, dan L. Pontieri, “Combining deep ensemble learning and explanation for intelligent ticket management,” Expert Syst. Appl., vol. 206, hal. 117815, 2022, doi: 10.1016/j.eswa.2022.117815.

G. Attigeri, A. Agrawal, dan S. Kolekar, “Advanced NLP Models for Technical University Information Chatbots: Development and Comparative Analysis,” IEEE Access, vol. 12, hal. 29633–29647, 2024, doi: 10.1109/access.2024.3368382.

J. Zhou, Z. Ye, S. Zhang, Z. Geng, N. Han, dan T. Yang, “Investigating response behavior through TF-IDF and Word2vec text analysis: A case study of PISA 2012 problem-solving process data,” Heliyon, vol. 10, 2024, doi: 10.1016/j.heliyon.2024.e35945.

M. Zhao dan K. Rabiei, “Feasibility of implementing the human resource payroll management system based on cloud computing,” Kybernetes, vol. 52, hal. 1245–1268, 2022, doi: 10.1108/k-07-2021-0554.

F. Ijebu, Y. Liu, C. Sun, dan P. Usip, “Soft cosine and extended cosine adaptation for pre-trained language model semantic vector analysis,” Appl. Soft Comput., vol. 169, hal. 112551, 2024, doi: 10.1016/j.asoc.2024.112551.

M. Jain, H. Kaur, B. Gupta, J. Gera, dan V. Kalra, “Incremental learning algorithm for dynamic evolution of domain specific vocabulary with its stability and plasticity analysis,” Sci. Rep., vol. 15, 2025, doi: 10.1038/s41598-024-78785-6.

L. Xiao, Q. Li, Qian, J. Shen, Y. Yang, dan D. Li, “Text classification algorithm of tourist attractions subcategories with modified TF-IDF and Word2Vec,” PLoS One, vol. 19, 2024, doi: 10.1371/journal.pone.0305095.

E. D. Madyatmadja, C. Sianipar, C. Wijaya, dan D. J. M. Sembiring, “Classifying Crowdsourced Citizen Complaints through Data Mining: Accuracy Testing of k-Nearest Neighbors, Random Forest, Support Vector Machine, and AdaBoost,” Informatics, vol. 10, hal. 84, 2023, doi: 10.3390/informatics10040084.

K. Mikael, C. Öz, R. K. Hamad, dan G. S. Nariman, “A Hybrid Chatbot Model for Enhancing Administrative Support in Education: Comparative Analysis, Integration, and Optimization,” IEEE Access, vol. 13, hal. 50741–50760, 2025, doi: 10.1109/access.2025.3552501.

A. Alamsyah dan Y. Sagama, “Empowering Indonesian internet users: An approach to counter online toxicity and enhance digital well-being,” Intell. Syst. Appl., vol. 22, hal. 200394, 2024, doi: 10.1016/j.iswa.2024.200394.

S. Liu, L. Zhang, W. Liu, J. Zhang, D. Gao, dan X. Jia, “The Evaluation Framework and Benchmark for Large Language Models in the Government Affairs Domain,” ACM Trans. Intell. Syst. Technol., 2025, doi: 10.1145/3716854.

R. Rianto, A. Mutiara, E. Wibowo, dan P. Santosa, “Improving the accuracy of text classification using stemming method, a case of non-formal Indonesian conversation,” J. Big Data, vol. 8, 2020, doi: 10.1186/s40537-021-00413-1.

S. Tahery dan S. Farzi, “An Adapted Few-Shot Prompting Technique Using ChatGPT to Advance Low-Resource Languages Understanding,” IEEE Access, vol. 13, hal. 93614–93628, 2025, doi: 10.1109/access.2025.3574115.

T.-L. Chen, M. Gascó-Hernández, dan M. Esteve, “The Adoption and Implementation of Artificial Intelligence Chatbots in Public Organizations: Evidence from U.S. State Governments,” Am. Rev. Public Adm., vol. 54, hal. 255–270, 2023, doi: 10.1177/02750740231200522.

Additional Files

Published

2026-02-15

How to Cite

[1]
E. Aprianto, D. . Mahdiana, and A. . Wibowo, “Optimizing Bag of Words and Word2Vec with Vocabulary Pruning and TF-IDF Weighted Embeddings for Accurate Chatbot Responses in Indonesian Treasury Services”, J. Tek. Inform. (JUTIF), vol. 7, no. 1, pp. 587–605, Feb. 2026.

Most read articles by the same author(s)

1 2 > >>