A Comprehensive Benchmarking Pipeline for Transformer-Based Sentiment Analysis using Cross-Validated Metrics

Authors

  • Dodo Zaenal Abidin Magister of Information System, Faculty of Computer Science, Universitas Dinamika Bangsa, Jambi, Indonesia
  • Lasmedi Afuan Informatics Engineering, Faculty of Computer Science, Universitas Jenderal Soedirman, Purwokerto, Jawa Tengah, Indonesia
  • Afrizal Nehemia Toscany Faculty of Computing, University Teknologi Malaysia, Johor Bahru, Malaysia
  • Nurhadi Magister of Information System, Faculty of Computer Science, Universitas Dinamika Bangsa, Jambi, Indonesia

DOI:

https://doi.org/10.52436/1.jutif.2025.6.4.4894

Keywords:

Benchmarking, Cross-validation, Evaluation metrics, IMDb, Sentiment analysis, Transformers

Abstract

Transformer-based models have significantly advanced sentiment analysis in natural language processing. However, many existing studies still lack robust, cross-validated evaluations and comprehensive performance reporting. This study proposes an integrated benchmarking pipeline for sentiment classification on the IMDb dataset using BERT, RoBERTa, and DistilBERT. The methodology includes systematic preprocessing, stratified 5-fold cross-validation, and aggregate evaluation through confusion matrices, ROC and precision-recall (PR) curves, and multi-metric classification reports. Experimental results demonstrate that all models achieve high accuracy, precision, recall, and F1-score, with RoBERTa leading overall (94.1% mean accuracy and F1), followed by BERT (92.8%) and DistilBERT (92.1%). All models exceed 0.97 in ROC-AUC and PR-AUC, confirming strong discriminative capability. Compared to prior approaches, this pipeline enhances result robustness, interpretability, and reproducibility. The provided results and open-source code offer a reliable reference for future research and practical deployment. This study is limited to the IMDb dataset in English, suggesting future work on multilingual, cross-domain, and explainable AI integration.

Downloads

Download data is not yet available.

References

T. Shaik et al., “A Review of the Trends and Challenges in Adopting Natural Language Processing Methods for Education Feedback Analysis,” IEEE Access, vol. 10, pp. 56720–56739, 2022, doi: 10.1109/ACCESS.2022.3177752.

Z. Kastrati, F. Dalipi, A. S. Imran, K. Pireva Nuci, and M. A. Wani, “Sentiment Analysis of Students’ Feedback with NLP and Deep Learning: A Systematic Mapping Study,” Appl. Sci., vol. 11, no. 9, p. 3986, Apr. 2021, doi: 10.3390/app11093986.

L. Zhang, S. Wang, and B. Liu, “Deep learning for sentiment analysis: A survey,” WIREs Data Min. Knowl. Discov., vol. 8, no. 4, p. e1253, Jul. 2018, doi: 10.1002/widm.1253.

M. Birjali, M. Kasri, and A. Beni-Hssane, “A comprehensive survey on sentiment analysis: Approaches, challenges and trends,” Knowl.-Based Syst., vol. 226, p. 107134, Aug. 2021, doi: 10.1016/j.knosys.2021.107134.

N. C. Dang, M. N. Moreno-García, and F. De La Prieta, “Sentiment Analysis Based on Deep Learning: A Comparative Study,” Electronics, vol. 9, no. 3, p. 483, Mar. 2020, doi: 10.3390/electronics9030483.

A. Yadav and D. K. Vishwakarma, “Sentiment analysis using deep learning architectures: a review,” Artif. Intell. Rev., vol. 53, no. 6, pp. 4335–4385, Aug. 2020, doi: 10.1007/s10462-019-09794-5.

M. Bansal, A. Goyal, and A. Choudhary, “A comparative analysis of K-Nearest Neighbor, Genetic, Support Vector Machine, Decision Tree, and Long Short Term Memory algorithms in machine learning,” Decis. Anal. J., vol. 3, p. 100071, Jun. 2022, doi: 10.1016/j.dajour.2022.100071.

I. Steinke, J. Wier, L. Simon, and R. Seetan, “Sentiment Analysis of Online Movie Reviews using Machine Learning,” Int. J. Adv. Comput. Sci. Appl., vol. 13, no. 9, 2022, doi: 10.14569/IJACSA.2022.0130973.

G. Nkhata, U. Anjum, and J. Zhan, “Sentiment Analysis of Movie Reviews Using BERT,” Feb. 26, 2025, arXiv: arXiv:2502.18841. doi: 10.48550/arXiv.2502.18841.

W. Ning, F. Wang, W. Wang, H. Wu, Q. Zhao, and T. Zhang, “Research on movie rating based on BERT-base model,” Sci. Rep., vol. 15, no. 1, p. 9156, Mar. 2025, doi: 10.1038/s41598-025-92430-w.

P. Sudhir and V. D. Suresh, “Comparative study of various approaches, applications and classifiers for sentiment analysis,” Glob. Transit. Proc., vol. 2, no. 2, pp. 205–211, Nov. 2021, doi: 10.1016/j.gltp.2021.08.004.

A. K. Durairaj and A. Chinnalagu, “Transformer based Contextual Model for Sentiment Analysis of Customer Reviews: A Fine-tuned BERT,” Int. J. Adv. Comput. Sci. Appl., vol. 12, no. 11, 2021, doi: 10.14569/IJACSA.2021.0121153.

F. T. J. Faria et al., “SentimentFormer: A Transformer-Based Multi-Modal Fusion Framework for Enhanced Sentiment Analysis of Memes in Under-Resourced Bangla Language,” Jan. 22, 2025, Computer Science and Mathematics. doi: 10.20944/preprints202501.1587.v1.

U. Naseem, I. Razzak, K. Musial, and M. Imran, “Transformer based Deep Intelligent Contextual Embedding for Twitter sentiment analysis,” Future Gener. Comput. Syst., vol. 113, pp. 58–69, Dec. 2020, doi: 10.1016/j.future.2020.06.050.

M. Kumar, L. Khan, and H.-T. Chang, “Evolving techniques in sentiment analysis: a comprehensive review,” PeerJ Comput. Sci., vol. 11, p. e2592, Jan. 2025, doi: 10.7717/peerj-cs.2592.

S. Tabinda Kokab, S. Asghar, and S. Naz, “Transformer-based deep learning models for the sentiment analysis of social media data,” Array, vol. 14, p. 100157, Jul. 2022, doi: 10.1016/j.array.2022.100157.

K. Kaushik and M. Parmar, “IMDb Movie Data Classification using Voting Classifier for Sentiment Analysis,” Int. J. Comput. Sci. Eng., vol. 10, no. 1, pp. 18–23, Jan. 2022, doi: 10.26438/ijcse/v10i1.1823.

P. Chakriswaran, D. R. Vincent, K. Srinivasan, V. Sharma, C.-Y. Chang, and D. G. Reina, “Emotion AI-Driven Sentiment Analysis: A Survey, Future Research Directions, and Open Issues,” Appl. Sci., vol. 9, no. 24, p. 5462, Dec. 2019, doi: 10.3390/app9245462.

Q. A. Xu, V. Chang, and C. Jayne, “A systematic review of social media-based sentiment analysis: Emerging trends and challenges,” Decis. Anal. J., vol. 3, p. 100073, Jun. 2022, doi: 10.1016/j.dajour.2022.100073.

S. Tripathi, R. Mehrotra, V. Bansal, and S. Upadhyay, “Analyzing Sentiment using IMDb Dataset,” in 2020 12th International Conference on Computational Intelligence and Communication Networks (CICN), Bhimtal, India: IEEE, Sep. 2020, pp. 30–33. doi: 10.1109/CICN49253.2020.9242570.

Z. Shaukat, A. A. Zulfiqar, C. Xiao, M. Azeem, and T. Mahmood, “Sentiment analysis on IMDB using lexicon and neural networks,” SN Appl. Sci., vol. 2, no. 2, p. 148, Feb. 2020, doi: 10.1007/s42452-019-1926-x.

S. M. Y. Iqbal Tomal, “Sentiment Analysis of IMDb Movie Reviews,” Int. J. Innov. Sci. Res. Technol. IJISRT, pp. 2338–2343, Jun. 2024, doi: 10.38124/ijisrt/IJISRT24MAY1625.

W. Ning, F. Wang, W. Wang, H. Wu, Q. Zhao, and T. Zhang, “Research on movie rating based on BERT-base model,” Sci. Rep., vol. 15, no. 1, p. 9156, Mar. 2025, doi: 10.1038/s41598-025-92430-w.

N. N. Marpid, Y. I. Kurniawan, and S. P. Rahayu, “ANALYSIS OF THE MOVIE DATABASE FILM RATING PREDICTION WITH ENSEMBLE LEARNING USING RANDOM FOREST REGRESSION METHOD,” J. Tek. Inform. Jutif, vol. 6, no. 1, pp. 1–10, Feb. 2025, doi: 10.52436/1.jutif.2025.6.1.1563.

Y. HaCohen-Kerner, D. Miller, and Y. Yigal, “The influence of preprocessing on text classification using a bag-of-words representation,” PLOS ONE, vol. 15, no. 5, p. e0232525, May 2020, doi: 10.1371/journal.pone.0232525.

M. Siino, I. Tinnirello, and M. La Cascia, “Is text preprocessing still worth the time? A comparative survey on the influence of popular preprocessing methods on Transformers and traditional classifiers,” Inf. Syst., vol. 121, p. 102342, Mar. 2024, doi: 10.1016/j.is.2023.102342.

D. Z. Abidin, S. Nurmaini, R. Firsandava Malik, Erwin, E. Rasywir, and Y. Pratama, “RSSI Data Preparation for Machine Learning,” in 2020 International Conference on Informatics, Multimedia, Cyber and Information System (ICIMCIS), Jakarta, Indonesia: IEEE, Nov. 2020, pp. 284–289. doi: 10.1109/ICIMCIS51567.2020.9354273.

C. Petridis, “Text Classification: Neural Networks VS Machine Learning Models VS Pre-trained Models,” Dec. 30, 2024, arXiv: arXiv:2412.21022. doi: 10.48550/arXiv.2412.21022.

R. V. K. Bevara, N. R. Mannuru, S. P. Karedla, and T. Xiao, “Scaling Implicit Bias Analysis across Transformer-Based Language Models through Embedding Association Test and Prompt Engineering,” Appl. Sci., vol. 14, no. 8, p. 3483, Apr. 2024, doi: 10.3390/app14083483.

M. Ni, Z. Sun, and W. Liu, “Reversible jump attack to textual classifiers with modification reduction,” Mach. Learn., vol. 113, no. 9, pp. 5907–5937, Sep. 2024, doi: 10.1007/s10994-024-06539-6.

M. Dörrich, M. Fan, and A. M. Kist, “Impact of Mixed Precision Techniques on Training and Inference Efficiency of Deep Neural Networks,” IEEE Access, vol. 11, pp. 57627–57634, 2023, doi: 10.1109/ACCESS.2023.3284388.

V. Varadharajan, N. Smith, D. Kalla, F. Samaah, and V. Mandala, “Deep Learning-Based Sentiment Analysis: Enhancing IMDb Review Classification with LSTM Models,” Univers. J. Comput. Sci. Commun., vol. 4, no. 1, pp. 1–14, Jan. 2025, doi: 10.31586/ujcsc.2025.1249.

P. Atandoh, F. Zhang, M. A. Al-antari, D. Addo, and Y. Hyeon Gu, “Scalable deep learning framework for sentiment analysis prediction for online movie reviews,” Heliyon, vol. 10, no. 10, p. e30756, May 2024, doi: 10.1016/j.heliyon.2024.e30756.

D. Z. Abidin, M. Rosario, and A. Sadikin, “Improving Term Deposit Customer Prediction Using Support Vector Machine with SMOTE and Hyperparameter Tuning in Bank Marketing Campaigns,” vol. 6, no. 3, 2025, doi: https://doi.org/10.52436/1.jutif.2025.6.3.4585.

Z. J. Wang, R. Turko, and D. H. Chau, “Dodrio: Exploring Transformer Models with Interactive Visualization,” Jun. 05, 2021, arXiv: arXiv:2103.14625. doi: 10.48550/arXiv.2103.14625.

A. M. P. Brasoveanu and R. Andonie, “Visualizing Transformers for NLP: A Brief Survey,” in 2020 24th International Conference Information Visualisation (IV), Melbourne, Australia: IEEE, Sep. 2020, pp. 270–279. doi: 10.1109/IV51561.2020.00051.

F. Alzamzami and A. E. Saddik, “Transformer-Based Feature Fusion Approach for Multimodal Visual Sentiment Recognition Using Tweets in the Wild,” IEEE Access, vol. 11, pp. 47070–47079, 2023, doi: 10.1109/ACCESS.2023.3274744.

Additional Files

Published

2025-08-18

How to Cite

[1]
D. Z. . Abidin, L. . Afuan, A. N. . Toscany, and N. Nurhadi, “A Comprehensive Benchmarking Pipeline for Transformer-Based Sentiment Analysis using Cross-Validated Metrics ”, J. Tek. Inform. (JUTIF), vol. 6, no. 4, pp. 1797–1810, Aug. 2025.

Most read articles by the same author(s)

1 2 > >>