Enhancing Classification of Self-Reported Monkeypox Symptoms on Social Media Using Term Frequency-Inverse Document Frequency Features and Graph Attention Networks

Authors

  • Rizailo Akfa Rizian Department of Computer Science, Lambung Mangkurat University, Indonesia
  • Irwan Budiman Department of Computer Science, Lambung Mangkurat University, Indonesia
  • Mohammad Reza Faisal Department of Computer Science, Lambung Mangkurat University, Indonesia
  • Dwi Kartini Department of Computer Science, Lambung Mangkurat University, Indonesia
  • Fatma Indriani Department of Computer Science, Lambung Mangkurat University, Indonesia
  • Umar Ali Ahmad Collaborative Researcher, Kanazawa University, Kanazawa, Ishikawa, Japan

DOI:

https://doi.org/10.52436/1.jutif.2025.6.6.5482

Keywords:

Graph Attention Network, Monkeypox, Social Media, Text Classification, TF-IDF

Abstract

Early detection of infectious diseases plays a crucial role in minimizing their spread and enabling timely intervention. In the digital era, social media has emerged as a valuable source of real-time health information, where individuals often share self-reported symptoms that can serve as early warning signals for disease outbreaks. However, textual data from social media is typically unstructured, noisy, and contextually diverse, posing challenges for conventional text classification methods. This study proposes a hybrid model combining Term Frequency–Inverse Document Frequency (TF-IDF) feature representation with a Graph Attention Network (GAT) to enhance the early detection of Monkeypox-related self-reported symptoms on Indonesian social media. A dataset of 3,200 tweets was collected through Tweet-Harvest and subsequently preprocessed and manually labeled, producing a balanced distribution between positive (51%) and negative (49%) samples. TF-IDF vectors were used to construct a document similarity graph via the k-Nearest Neighbors (k-NN) method with cosine similarity, enabling GAT to leverage both textual and relational information across posts. The model’s performance was evaluated using accuracy, precision, recall, and macro-F1, with macro-F1 serving as the primary indicator. The proposed TF-IDF + GAT model achieved 93.07% accuracy and a macro-F1 score of 93.06%, outperforming baseline classifiers such as CNN (92.16% macro-F1), SVM (85.73%), Logistic Regression (84.89%). These findings demonstrate the effectiveness of integrating classical text representations with graph-based neural architectures for improving social media based disease surveillance and supporting early epidemic response strategies.

Downloads

Download data is not yet available.

References

E. M. Bunge et al., “The changing epidemiology of human monkeypox—A potential threat? A systematic review,” PLoS Negl. Trop. Dis., vol. 16, no. 2, pp. 1–20, 2022, doi: 10.1371/journal.pntd.0010141.

J. P. Thornhill et al., “Monkeypox Virus Infection in Humans across 16 Countries — April–June 2022,” N. Engl. J. Med., vol. 387, no. 8, pp. 679–691, 2022, doi: 10.1056/nejmoa2207323.

WHO, “Multi-country monkeypox outbreak: situation update.” Accessed: Nov. 28, 2025. [Online]. Available: https://www.who.int/emergencies/disease-outbreak-news/item/2022-DON396

H. Adler et al., “Clinical features and management of human monkeypox : a retrospective observational study in the UK,” vol. 22, no. 8, pp. 1153–1162, 2022, doi: 10.1016/S1473-3099(22)00228-6.

A. Y. Cheema, O. J. Ogedegbe, M. Munir, G. Alugba, and T. K. Ojo, “Monkeypox : A Review of Clinical Features , Diagnosis , and Treatment,” vol. 14, no. 7, pp. 14–17, 2022, doi: 10.7759/cureus.26756.

C. Raina MacIntyre et al., “Early detection of emerging infectious diseases - implications for vaccine development,” Vaccine, vol. 42, no. 7, pp. 1826–1830, 2024, doi: 10.1016/j.vaccine.2023.05.069.

R. Meckawy, D. Stuckler, A. Mehta, T. Al-Ahdal, and B. N. Doebbeling, “Effectiveness of early warning systems in the detection of infectious diseases outbreaks: a systematic review,” BMC Public Health, vol. 22, no. 1, pp. 1–62, 2022, doi: 10.1186/s12889-022-14625-4.

E. Chen, K. Lerman, and E. Ferrara, “Tracking social media discourse about the COVID-19 pandemic: Development of a public coronavirus Twitter data set,” JMIR Public Heal. Surveill., vol. 6, no. 2, Apr. 2020, doi: 10.2196/19273.

E. Du, E. Chen, J. Liu, and C. Zheng, “How do social media and individual behaviors affect epidemic transmission and control,” no. January, 2020, doi: 10.1016/j.scitotenv.2020.144114.

B. Shi, W. Huang, Y. Dang, and W. Zhou, “Leveraging social media data for pandemic detection and prediction,” Humanit. Soc. Sci. Commun., vol. 11, no. 1, 2024, doi: 10.1057/s41599-024-03589-y.

J. Khan, K. Ahmad, S. K. Jagatheesaperumal, and K. A. Sohn, “Textual variations in social media text processing applications: challenges, solutions, and trends,” Artif. Intell. Rev., vol. 58, no. 3, Mar. 2025, doi: 10.1007/s10462-024-11071-z.

M. Rodríguez-Ibánez, A. Casánez-Ventura, F. Castejón-Mateos, and P. M. Cuenca-Jiménez, “A review on sentiment analysis from social media platforms,” Aug. 01, 2023, Elsevier Ltd. doi: 10.1016/j.eswa.2023.119862.

S. Park, S. Oh, and W. Park, “Automated Classification Model for Elementary Mathematics Diagnostic Assessment Data Based on TF-IDF and XGBoost,” Appl. Sci., vol. 15, no. 7, Apr. 2025, doi: 10.3390/app15073764.

K. Li, “Haha at fakedes 2021: A fake news detection method based on tf-idf and ensemble machine learning,” CEUR Workshop Proc., vol. 2943, no. September, pp. 630–638, 2021.

P. Pilipiec, I. Samsten, and A. Bota, Surveillance of communicable diseases using social media: A systematic review, vol. 18, no. 2 February. 2023. doi: 10.1371/journal.pone.0282101.

M. Das, S. Kamalanathan, and P. Alphonse, “A Comparative Study on TF-IDF feature weighting method and its analysis using unstructured dataset,” CEUR Workshop Proc., vol. 2870, pp. 98–107, 2021.

Sutriawan, S. Rustad, G. F. Shidik, and Pujiono, “Performance Evaluation of Text Embedding Models for Ambiguity Classification in Indonesian News Corpus: A Comparative Study of TF-IDF, Word2Vec, FastText BERT, and GPT,” Ing. des Syst. d’Information, vol. 30, no. 6, pp. 1469–1482, 2025, doi: 10.18280/isi.300606.

S. A. Sazan, M. H. Miraz, and A. B. M. Muntasir Rahman, “Enhancing Depressive Post Detection in Bangla: A Comparative Study of TF-IDF, BERT and FastText Embeddings,” Ann. Emerg. Technol. Comput., vol. 8, no. 3, pp. 34–49, 2024, doi: 10.33166/AETiC.2024.03.003.

V. Rai and S. Rai, “Attention Mechanisms in Graph Neural Networks for Fake News Detection: A Critical Review and Open Issues,” Researchgate.Net, no. February 2024, 2025, doi: 2024/IJEASM/5/2024/1996a.

Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu, “A Comprehensive Survey on Graph Neural Networks,” Dec. 2020, doi: 10.1109/TNNLS.2020.2978386.

K. Wang, Y. Ding, and S. C. Han, “Graph neural networks for text classification: a survey,” Artif. Intell. Rev., vol. 57, no. 8, Aug. 2024, doi: 10.1007/s10462-024-10808-0.

S. Brody, U. Alon, and E. Yahav, “How Attentive Are Graph Attention Networks?,” ICLR 2022 - 10th Int. Conf. Learn. Represent., pp. 1–26, 2022, doi: 10.48550/arXiv.2105.14491.

J. Li, Y. Jian, and Y. Xiong, “Text Classification Model Based on Graph Attention Networks and Adversarial Training,” Appl. Sci., vol. 14, no. 11, Jun. 2024, doi: 10.3390/app14114906.

A. Malik, D. K. Behera, J. Hota, and A. R. Swain, “Ensemble graph neural networks for fake news detection using user engagement and text features,” Results Eng., vol. 24, Dec. 2024, doi: 10.1016/j.rineng.2024.103081.

S. A. Zikrina and Fitriyani, “Advancing Hate Speech Detection in Indonesian Language Using Graph Neural Networks and TF-IDF,” J. RESTI, vol. 9, no. 1, pp. 137–145, Feb. 2025, doi: 10.29207/resti.v9i1.6179.

E. Gao, H. Yang, D. Sun, H. Xia, Y. Ma, and Y. Zhu, “Text Classification Optimization Algorithm Based on Graph Neural Network,” 2024 IEEE 6th Int. Conf. Power, Intell. Comput. Syst. ICPICS 2024, pp. 814–822, 2024, doi: 10.1109/ICPICS62053.2024.10796365.

B. Nath, D. Sahoo, and S. S. Patra, “Leveraging Hybrid Model for Classification of Disaster-Related Tweets using TF-IDF and GCN,” Nanotechnol. Perceptions, vol. 20, no. 12, pp. 52–72, 2024, doi: 10.62441/nano-ntp.v20is12.4.

Y. Liscano, L. A. Anillo Arrieta, J. F. Montenegro, D. Prieto-Alvarado, and J. Ordoñez, “Early Warning of Infectious Disease Outbreaks Using Social Media and Digital Data: A Scoping Review,” Int. J. Environ. Res. Public Health, vol. 22, no. 7, pp. 1–34, 2025, doi: 10.3390/ijerph22071104.

D. Chicco and G. Jurman, “The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation,” BMC Genomics, vol. 21, no. 1, Jan. 2020, doi: 10.1186/s12864-019-6413-7.

N. A. Rahmi, S. Defit, and Okfalisa, “The Use of Hyperparameter Tuning in Model Classification: A Scientific Work Area Identification,” Int. J. Informatics Vis., vol. 8, no. 4, pp. 2181–2188, 2024, doi: 10.62527/joiv.8.4.3092.

M. Giancotti, M. Lopreite, M. Mauro, and M. Puliga, “Innovating health prevention models in detecting infectious disease outbreaks through social media data: an umbrella review of the evidence,” Front. Public Heal., vol. 12, no. November, 2024, doi: 10.3389/fpubh.2024.1435724.

P. Kumar and K. Garg, “Data Cleaning of Raw Tweets for Sentiment Analysis,” pp. 273–276, 2020, doi: 10.1109/Indo-TaiwanICAN48429.2020.9181326.

S. Shevira, I. M. Agus, D. Suarjaya, and P. Wira, “Pengaruh Kombinasi dan Urutan Pre-Processing pada Tweets Bahasa Indonesia,” vol. 3, no. 2, 2022, doi: 10.24843/JTRTI.2022.v03.i02.p06.

P. Prihatini, K. Indah, G. Sukerti, I. Indrayana, and I. Sudiartha, “Feature Extraction Performance on Classified Methods for Text Sentiment Analysis,” pp. 1235–1243, 2023, doi: 10.5220/0010962900003260.

Y. Zhang, Y. Zhou, and J. T. Yao, “Feature Extraction with TF-IDF and Game-Theoretic Shadowed Sets,” Commun. Comput. Inf. Sci., vol. 1237 CCIS, pp. 722–733, 2020, doi: 10.1007/978-3-030-50146-4_53.

P. Guleria, J. Frnda, and P. N. Srinivasu, “NLP based text classification using TF-IDF enabled fine-tuned long short-term memory: An empirical analysis,” Array, vol. 27, no. July, 2025, doi: 10.1016/j.array.2025.100467.

A. Nazarkar, H. Kuchulakanti, C. S. Paidimarry, and S. Kulkarni, Impact of Various Data Splitting Ratios on the Performance of Machine Learning Models in the Classification of Lung Cancer, vol. 1. Atlantis Press International BV, 2023. doi: 10.2991/978-94-6463-252-1_12.

L. Li, W. Yang, S. Bai, and Z. Ma, “KNN-GNN: A powerful graph neural network enhanced by aggregating K-nearest neighbors in common subspace,” Expert Syst. Appl., vol. 253, no. May, 2024, doi: 10.1016/j.eswa.2024.124217.

Additional Files

Published

2026-01-05

How to Cite

[1]
R. A. Rizian, I. Budiman, M. R. Faisal, D. Kartini, F. Indriani, and U. A. . Ahmad, “Enhancing Classification of Self-Reported Monkeypox Symptoms on Social Media Using Term Frequency-Inverse Document Frequency Features and Graph Attention Networks”, J. Tek. Inform. (JUTIF), vol. 6, no. 6, pp. 5865–5881, Jan. 2026.

Most read articles by the same author(s)