Comparative Analysis of Data Balancing Techniques for Machine Learning Classification on Imbalanced Student Perception Datasets

Authors

  • Ahmad Saekhu Magister of Computer Science, Universitas Amikom Purwokerto, Jawa Tengah, Indonesia
  • Berlilana Magister of Computer Science, Universitas Amikom Purwokerto, Jawa Tengah, Indonesia
  • Dhanar Intan Surya Saputra Magister of Computer Science, Universitas Amikom Purwokerto, Jawa Tengah, Indonesia

DOI:

https://doi.org/10.52436/1.jutif.2025.6.2.4286

Keywords:

Class imbalance, Classification performance, Ensemble models, Machine learning, SMOTE

Abstract

Class imbalance is a common challenge in machine learning classification tasks, often leading to biased predictions toward the majority class. This study evaluates the effectiveness of various machine learning algorithms combined with advanced data balancing techniques in addressing class imbalance in a dataset collected from Class XI students of SMK Ma'arif 1 Kebumen. The dataset, comprising 300 instances and 36 features, includes textual attributes, demographic information, and sentiment labels categorized as Positive, Neutral, and Negative. Preprocessing steps included text cleaning, target encoding, handling missing data, and vectorization. Four sampling techniques—SMOTE, SMOTE + Tomek Links, ADASYN, and SMOTE + ENN—were applied to the training data to create balanced datasets. Nine machine learning algorithms, including CatBoost, Extra Trees, Random Forest, Gradient Boosting, and others, were evaluated using four train-test splits (60:40, 70:30, 80:20, and 90:10). Model performance was assessed using metrics such as accuracy, precision, recall, F1-score, and AUC- ROC. The results demonstrate that SMOTE + Tomek Links is the most effective balancing technique, achieving the highest accuracy when paired with ensemble algorithms like Extra Trees and Random Forest. CatBoost also delivered competitive performance, showcasing its adaptability in imbalanced scenarios. The 90:10 train-test split consistently yielded the best results, emphasizing the importance of adequate training data for model generalization. This study highlights the critical role of data balancing techniques and robust algorithms in optimizing classification performance for imbalanced datasets and provides a framework for future research in similar contexts.

Downloads

Download data is not yet available.

References

J. M. Johnson and T. Khoshgoftaar, "Survey on deep learning with class imbalance," Journal of Big Data, vol. 6, no. 1, pp. 1-54, 2019. DOI: 10.1186/s40537-019-0192-5.

L. A. Sevastyanov and E. Shchetinin, "On methods for improving the accuracy of multi-class classification on imbalanced data," Pattern Recognition and Image Analysis, vol. 30, no. 1, pp. 70–82, 2020. DOI: 10.14357/19922264200109.

G. Du et al., "Graph-based class-imbalance learning with label enhancement," IEEE Transactions on Neural Networks and Learning Systems, vol. 33, no. 2, pp. 530–544, 2021. DOI: 10.1109/TNNLS.2021.3133262.

B. S. Raghuwanshi and S. Shukla, "Class imbalance learning using underbagging-based kernelized extreme learning machine," Neurocomputing, vol. 329, pp. 172–187, 2019. DOI: 10.1016/j.neucom.2018.10.056.

M. Bader-El-Den, E. Teitei, and T. Perry, "Biased random forest for dealing with the class imbalance problem," IEEE Transactions on Neural Networks and Learning Systems, vol. 30, no. 7, pp. 2163–2172, 2019. DOI: 10.1109/TNNLS.2018.2878400.

O. Wu, "Rethinking class imbalance in machine learning," ArXiv, vol. abs/2305.03900, pp. 1– 15, 2023. DOI: 10.48550/arXiv.2305.03900.

L. Dube and T. Verster, "Enhancing classification performance in imbalanced datasets: A comparative analysis of machine learning models," Data Science in Finance and Economics, vol. 4, no. 1, pp. 1–21, 2023. DOI: 10.3934/dsfe.2023021.

R. Choudhary and S. Shukla, "A clustering-based ensemble of weighted kernelized extreme learning machine for class imbalance learning," Expert Systems with Applications, vol. 164, p. 114041, 2021. DOI: 10.1016/J.ESWA.2020.114041.

S. Rezvani and X. Wang, "Class imbalance learning using fuzzy ART and intuitionistic fuzzy twin support vector machines," Information Sciences, vol. 578, pp. 659–682, 2021. DOI: 10.1016/J.INS.2021.07.010.

E. Jiang, "A hybrid learning framework for imbalanced classification," International Journal of Intelligent Information Technologies, vol. 19, no. 1, pp. 1–15, 2022. DOI: 10.4018/ijiit.306967.

S. Rezvani and X. Wang, "Class imbalance learning using fuzzy ART and intuitionistic fuzzy twin support vector machines," Information Sciences, vol. 578, pp. 659–682, 2021. DOI: 10.1016/J.INS.2021.07.010.

O. Wu, "Rethinking class imbalance in machine learning," ArXiv, vol. abs/2305.03900, pp. 1– 15, 2023. DOI: 10.48550/arXiv.2305.03900.

L. Dube and T. Verster, "Enhancing classification performance in imbalanced datasets: A comparative analysis of machine learning models," Data Science in Finance and Economics, vol. 4, no. 1, pp. 1–21, 2023. DOI: 10.3934/dsfe.2023021.

L. Lhaura et al., "Enhancing machine learning model performance in addressing class imbalance," CogITo Smart Journal, vol. 10, no. 1, 2024. DOI: 10.31154/cogito.v10i1.626.478- 490.

S. Wang, L. L. Minku, N. Chawla, and X. Yao, "Learning from data streams and class imbalance," Connection Science, vol. 31, no. 1, pp. 103–104, 2019. DOI: 10.1080/09540091.2019.1572975.

Z. Liu et al., "Handling inter-class and intra-class imbalance in class-imbalanced learning,"

Proceedings of ICML, 2021.

S. Mirsadeghi, H. Bahsi, R. Vaarandi, and W. Inoubli, "Learning from few cyber-attacks: Addressing the class imbalance problem in machine learning-based intrusion detection in software-defined networking," IEEE Access, vol. 11, pp. 140428–140442, 2023. DOI: 10.1109/ACCESS.2023.3341755.

J. Shetty and G. Shobha, "Handling class imbalance in Google cluster dataset using a new hybrid sampling approach," Journal of Advances in Information Technology, vol. 14, no. 5, pp. 934–940, 2023. DOI: 10.12720/jait.14.5.934-940.

M. Abdelhamid and A. Desai, "Balancing the scales: A comprehensive study on tackling class imbalance in binary classification," ArXiv, vol. abs/2409.19751, 2024. DOI: 10.48550/arXiv.2409.19751.

M. E. Sánchez-Gutiérrez and P. P. González-Pérez, "Addressing the class imbalance in tabular datasets from a generative adversarial network approach in supervised machine learning," Journal of Algorithms & Computational Technology, vol. 17, no. 1, pp. 151–168, 2023. DOI: 10.1177/17483026231215186.

Z. Chen, J. Duan, L. Kang, and G. Qiu, "Class-imbalanced deep learning via a class-balanced ensemble," IEEE Transactions on Neural Networks and Learning Systems, vol. 33, no. 10, pp. 5626–5640, 2021. DOI: 10.1109/TNNLS.2021.3071122.

E. Rendón et al., "Density-based clustering to deal with highly imbalanced data in multi-class problems," Mathematics, vol. 11, no. 18, 2023. DOI: 10.3390/math11184008.

J. Du, G. Qiu, Y. Lin, and S. Li, "Graph-based class-imbalance learning with label enhancement," IEEE Transactions on Neural Networks and Learning Systems, vol. 33, no. 2,

pp. 530–544, 2021. DOI: 10.1109/TNNLS.2021.3133262.

A. Thumpati and Y. Zhang, "Towards optimizing performance of machine learning algorithms on unbalanced dataset," Artificial Intelligence & Applications, vol. 13, no. 1, pp. 131–140, 2023. DOI: 10.5121/csit.2023.131914.

M. Ayyannan, "Accuracy enhancement of machine learning model by handling imbalance data," 2024 International Conference on Expert Clouds and Applications (ICOECA), vol. 1, pp. 593–599, 2024. DOI: 10.1109/ICOECA62351.2024.00109.

S. Budania, T. Kumar, H. Kumar, and G. Nikam, "Hybrid machine intelligence for imbalanced data," Social Science Research Network, vol. 36, no. 4, pp. 441–452, 2020. DOI: 10.2139/ssrn.3602531.

H. Kaur, H. Pannu, and A. Malhi, "A systematic review on imbalanced data challenges in machine learning," ACM Computing Surveys (CSUR), vol. 52, no. 6, pp. 1–36, 2019. DOI: 10.1145/3343440.

V. H. Barella et al., "Assessing the data complexity of imbalanced datasets," Information Sciences, vol. 553, pp. 83–109, 2021. DOI: 10.1016/j.ins.2020.12.006.

M. Zheng, F. Wang, X. Hu, Y. Miao, H. Cao, and M. Tang, "A method for analyzing the performance impact of imbalanced binary data on machine learning models," Axioms, vol. 11, no. 6, p. 607, 2022. DOI: 10.3390/axioms11110607.

S. Ashraf and T. Ahmed, "Machine learning shrewd approach for an imbalanced dataset conversion samples," Journal of Engineering and Technology, vol. 11, no. 3, pp. 115–123, 2020.

H. Patel, D. Rajput, O. Stan, and L. Miclea, "A new fuzzy adaptive algorithm to classify imbalanced data," Computers, Materials & Continua, vol. 72, no. 1, pp. 15–29, 2022. DOI: 10.32604/cmc.2022.017114.

L. Dube and T. Verster, "Enhancing classification performance in imbalanced datasets: A comparative analysis of machine learning models," Data Science in Finance and Economics, vol. 4, no. 2, pp. 25–40, 2023. DOI: 10.3934/dsfe.2023021.

H. Du, Y. Zhang, K. Gang, L. Zhang, and Y. Chen, "Online ensemble learning algorithm for imbalanced data stream," Applied Soft Computing, vol. 107, p. 107378, 2021. DOI: 10.1016/J.ASOC.2021.107378.

Additional Files

Published

2025-04-26

How to Cite

[1]
A. Saekhu, B. Berlilana, and D. I. S. . Saputra, “Comparative Analysis of Data Balancing Techniques for Machine Learning Classification on Imbalanced Student Perception Datasets”, J. Tek. Inform. (JUTIF), vol. 6, no. 2, pp. 627–640, Apr. 2025.