Comparative Analysis of GPT-2 Augmentation, ALBERT, and Similarity Measures for Cyberbullying Detection

Authors

  • Zidane Hidayat Faculty of Information Technology and Data Science, Universitas Sebelas Maret, Indonesia
  • Hasan Dwi Cahyono Faculty of Information Technology and Data Science, Universitas Sebelas Maret, Indonesia
  • Fajar Muslim Faculty of Information Technology and Data Science, Universitas Sebelas Maret, Indonesia

DOI:

https://doi.org/10.52436/1.jutif.2026.7.2.5320

Keywords:

ALBERT, Cyberbullying, Data Augmentation, GPT-2, NLP, Similarity Measure

Abstract

The effectiveness of cyberbullying detection is influenced by the availability of sufficient, diverse, and contextually rich training data, which is often limited in low-resource languages such as Indonesian. To address dataset limitations, researchers have extensively explored data augmentation (DA) as a promising approach to improving model performance. DA generates new data instances by applying transformations to existing data, thereby increasing both dataset size and variability. Prior studies have demonstrated that applying Easy Data Augmentation (EDA) with Support Vector Machine (SVM) classification improved cyberbullying detection performance, even when it faced challenges in capturing semantic and contextual nuances. In this paper, we investigated Indonesian DA methods using the Transformer-based GPT-2 model. The augmented sentences were evaluated and filtered based on context, semantics, diversity, and novelty, with similarity measures such as Euclidean Distance (ED), Cosine Similarity (CS), Jaccard Similarity (JS), and BLEU Score (BLS) ensuring the quality of the augmentation. Furthermore, we compared text classification performance using both SVM and the Transformer-based ALBERT model. Experimental results revealed that incorporating similarity measures and GPT-2 as a DA method failed to improve cyberbullying detection performance, potentially due to the semantic drift introduced by GPT-2 and the inadequacy of similarity measures in capturing nuanced contextual information. However, we found that ALBERT outperformed SVM as a classification model, achieving average F1-scores of 91.77% and 91.72%, respectively. This study contributes to the informatics field by exploring the potential of Transformer-based augmentation and similarity evaluation in enhancing low-resource text classification, while acknowledging the limitations in data quality and model adaptation.

Downloads

Download data is not yet available.

References

Y. E. Riany and F. Utami, “Cyberbullying Perpetration among Adolescents in Indonesia: The Role of Fathering and Peer Attachment,” Int Journal of Bullying Prevention, May 2023, doi: 10.1007/s42380-023-00165-x.

U. Kamath, J. Liu, and J. Whitaker, Deep Learning for NLP and Speech Recognition. Cham: Springer International Publishing, 2019. doi: 10.1007/978-3-030-14596-5.

C. Manning, P. Raghavan, and H. Schuetze, Introduction to Information Retrieval. 2009.

J. Wei and K. Zou, “EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks,” Aug. 25, 2019, arXiv: arXiv:1901.11196. doi: 10.48550/arXiv.1901.11196.

J.-P. Corbeil and H. A. Ghadivel, “BET: A Backtranslation Approach for Easy Data Augmentation in Transformer-based Paraphrase Identification Context,” Sept. 25, 2020, arXiv: arXiv:2009.12452. doi: 10.48550/arXiv.2009.12452.

X. Dai and H. Adel, “An Analysis of Simple Data Augmentation for Named Entity Recognition,” in Proceedings of the 28th International Conference on Computational Linguistics, D. Scott, N. Bel, and C. Zong, Eds., Barcelona, Spain (Online): International Committee on Computational Linguistics, Dec. 2020, pp. 3861–3867. doi: 10.18653/v1/2020.coling-main.343.

G. Daval-Frerot and Y. Weis, “WMD at SemEval-2020 Tasks 7 and 11: Assessing Humor and Propaganda Using Unsupervised Data Augmentation,” in Proceedings of the Fourteenth Workshop on Semantic Evaluation, Barcelona (online): International Committee for Computational Linguistics, 2020, pp. 1865–1874. doi: 10.18653/v1/2020.semeval-1.246.

A. Fabbri et al., “Improving Zero and Few-Shot Abstractive Summarization with Intermediate Fine-tuning and Data Augmentation,” in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, K. Toutanova, A. Rumshisky, L. Zettlemoyer, D. Hakkani-Tur, I. Beltagy, S. Bethard, R. Cotterell, T. Chakraborty, and Y. Zhou, Eds., Online: Association for Computational Linguistics, June 2021, pp. 704–717. doi: 10.18653/v1/2021.naacl-main.57.

Y. Hou, S. Chen, W. Che, C. Chen, and T. Liu, “C2C-GenDA: Cluster-to-Cluster Generation for Data Augmentation of Slot Filling,” AAAI, vol. 35, no. 14, pp. 13027–13035, May 2021, doi: 10.1609/aaai.v35i14.17540.

C. Rastogi, N. Mofid, and F.-I. Hsiao, “Can We Achieve More with Less? Exploring Data Augmentation for Toxic Comment Classification,” July 02, 2020, arXiv: arXiv:2007.00875. doi: 10.48550/arXiv.2007.00875.

G. Yan, Y. Li, S. Zhang, and Z. Chen, “Data Augmentation for Deep Learning of Judgment Documents,” in Intelligence Science and Big Data Engineering. Big Data and Machine Learning, vol. 11936, Z. Cui, J. Pan, S. Zhang, L. Xiao, and J. Yang, Eds., in Lecture Notes in Computer Science, vol. 11936. , Cham: Springer International Publishing, 2019, pp. 232–242. doi: 10.1007/978-3-030-36204-1_19.

Q. Xie, Z. Dai, E. Hovy, M.-T. Luong, and Q. V. Le, “Unsupervised data augmentation for consistency training,” in Proceedings of the 34th International Conference on Neural Information Processing Systems, in NIPS ’20. Red Hook, NY, USA: Curran Associates Inc., 2020.

C. Shorten, T. M. Khoshgoftaar, and B. Furht, “Text Data Augmentation for Deep Learning,” Journal of Big Data, vol. 8, no. 1, p. 101, July 2021, doi: 10.1186/s40537-021-00492-0.

A. Wirawan, H. D. Cahyono, and Winarno, “Easy Data Augmentation in Sentiment Analysis of Cyberbullying,” in 2023 6th International Conference on Information and Communications Technology (ICOIACT), 2023, pp. 443–447. doi: 10.1109/ICOIACT59844.2023.10455817.

S. Y. Feng et al., “A Survey of Data Augmentation Approaches for NLP,” in Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, C. Zong, F. Xia, W. Li, and R. Navigli, Eds., Online: Association for Computational Linguistics, Aug. 2021, pp. 968–988. doi: 10.18653/v1/2021.findings-acl.84.

A. Vaswani et al., “Attention is all you need,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, in NIPS’17. Red Hook, NY, USA: Curran Associates Inc., 2017, pp. 6000–6010.

T. Kober, J. Weeds, L. Bertolini, and D. Weir, “Data Augmentation for Hypernymy Detection,” in Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, P. Merlo, J. Tiedemann, and R. Tsarfaty, Eds., Online: Association for Computational Linguistics, Apr. 2021, pp. 1034–1048. doi: 10.18653/v1/2021.eacl-main.89.

K. Li, C. Chen, X. Quan, Q. Ling, and Y. Song, “Conditional Augmentation for Aspect Term Extraction via Masked Sequence-to-Sequence Generation,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, D. Jurafsky, J. Chai, N. Schluter, and J. Tetreault, Eds., Online: Association for Computational Linguistics, July 2020, pp. 7056–7066. doi: 10.18653/v1/2020.acl-main.631.

A. Celikyilmaz, E. Clark, and J. Gao, “Evaluation of Text Generation: A Survey,” May 18, 2021, arXiv: arXiv:2006.14799. doi: 10.48550/arXiv.2006.14799.

R. Mussabayev, “Optimizing Euclidean Distance Computation,” Mathematics, vol. 12, no. 23, p. 3787, Nov. 2024, doi: 10.3390/math12233787.

J. Zobel and A. Moffat, “Exploring the similarity space,” SIGIR Forum, vol. 32, no. 1, pp. 18–34, Apr. 1998, doi: 10.1145/281250.281256.

G. Travieso, A. Benatti, and L. da F. Costa, “An Analytical Approach to the Jaccard Similarity Index,” Oct. 21, 2024, arXiv: arXiv:2410.16436. doi: 10.48550/arXiv.2410.16436.

R. Bawden, B. Zhang, L. Yankovskaya, A. Tättar, and M. Post, “A Study in Improving BLEU Reference Coverage with Diverse Automatic Paraphrasing,” in Findings of the Association for Computational Linguistics: EMNLP 2020, T. Cohn, Y. He, and Y. Liu, Eds., Online: Association for Computational Linguistics, Nov. 2020, pp. 918–932. doi: 10.18653/v1/2020.findings-emnlp.82.

J. Li, X. Zhang, and X. Zhou, “ALBERT-Based Self-Ensemble Model With Semisupervised Learning and Data Augmentation for Clinical Semantic Textual Similarity Calculation: Algorithm Validation Study,” JMIR Med Inform, vol. 9, no. 1, p. e23086, Jan. 2021, doi: 10.2196/23086.

“Demographic Statistics Indonesia (Results of Population Census 2020).” BPS-Statistics Indonesia, Jan. 31, 2025. Accessed: May 20, 2025. [Online]. Available: https://www.bps.go.id/en/publication/2025/01/31/29a40174e02f20a7a31b5bc3/demographic-statistics-indonesia--results-of-population-census-2020-.html

L. Sundary and F. Fauzah, “Studi Analisis Perkembangan Bahasa Indonesia di Era Digital,” Innovative, vol. 4, no. 3, pp. 11295–11303, June 2024, doi: 10.31004/innovative.v4i3.11633.

S. F. N. Azizah, H. D. Cahyono, S. W. Sihwi, and W. Widiarto, “Performance Analysis of Transformer Based Models (BERT, ALBERT, and RoBERTa) in Fake News Detection,” in 2023 6th International Conference on Information and Communications Technology (ICOIACT), 2023, pp. 425–430. doi: 10.1109/ICOIACT59844.2023.10455849.

D. Refai, S. Abu-Soud, and M. J. Abdel-Rahman, “Data Augmentation Using Transformers and Similarity Measures for Improving Arabic Text Classification,” IEEE Access, vol. 11, pp. 132516–132531, 2023, doi: 10.1109/ACCESS.2023.3336311.

V. Maslej-Krešňáková, M. Sarnovský, and J. Jacková, “Use of Data Augmentation Techniques in Detection of Antisocial Behavior Using Deep Learning Methods,” Future Internet, vol. 14, no. 9, p. 260, Aug. 2022, doi: 10.3390/fi14090260.

N. A. Ranggianto, D. Purwitasari, C. Fatichah, and R. W. Sholikah, “Abstractive and Extractive Approaches for Summarizing Multi-document Travel Reviews,” J. RESTI (Rekayasa Sist. Teknol. Inf.), vol. 7, no. 6, pp. 1464–1475, Dec. 2023, doi: 10.29207/resti.v7i6.5170.

F. Sufi, “Generative Pre-Trained Transformer (GPT) in Research: A Systematic Review on Data Augmentation,” Information, vol. 15, no. 2, p. 99, Feb. 2024, doi: 10.3390/info15020099.

A. W. Qurashi, V. Holmes, and A. P. Johnson, “Document Processing: Methods for Semantic Text Similarity Analysis,” in 2020 International Conference on INnovations in Intelligent SysTems and Applications (INISTA), Novi Sad, Serbia: IEEE, Aug. 2020, pp. 1–6. doi: 10.1109/INISTA49547.2020.9194665.

D. A. Pisner and D. M. Schnyer, “Support vector machine,” in Machine Learning, Elsevier, 2020, pp. 101–121. doi: 10.1016/B978-0-12-815739-8.00006-7.

Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut, “ALBERT: A Lite BERT for Self-supervised Learning of Language Representations,” Feb. 09, 2020, arXiv: arXiv:1909.11942. doi: 10.48550/arXiv.1909.11942.

Additional Files

Published

2026-04-15

How to Cite

[1]
Z. Hidayat, H. D. Cahyono, and F. Muslim, “Comparative Analysis of GPT-2 Augmentation, ALBERT, and Similarity Measures for Cyberbullying Detection”, J. Tek. Inform. (JUTIF), vol. 7, no. 2, pp. 875–890, Apr. 2026.