Comparative Analysis of Machine Learning Algorithms with RFE-CV for Student Dropout Prediction

Sekar Gesti Amalia  Utami; Haryono Setiadi; Arif Rohmadi

doi:10.52436/1.jutif.2025.6.3.4695

Authors

Sekar Gesti Amalia Utami Informatics, Universitas Sebelas Maret, Indonesia
Haryono Setiadi Informatics, Universitas Sebelas Maret, Indonesia
Arif Rohmadi Informatics, Universitas Sebelas Maret, Indonesia

DOI:

https://doi.org/10.52436/1.jutif.2025.6.3.4695

Keywords:

Educational Data Mining, Machine Learning, RFE-CV, Student Dropout

Abstract

The high dropout rate of students in higher education is a problem faced by educational institutions, impacting quality assessments and accreditation evaluations by BAN-PT. This study aims to develop an early prediction model of potential dropout students using demographic data with a learning analytics approach. Five classification algorithms are used in this research, namely Random Forest (RF), Decision Tree (DT), Logistic Regression (LR), Light Gradient Boosting Machine (LGBM), and Support Vector Machine (SVM). The dataset used consists of undergraduate student data of Sebelas Maret University in 2013 (n=2476) which is processed through preprocessing techniques, resampling with SMOTE, and validation using K-Fold Cross-Validation. The results showed that the RF model gave the best performance with an accuracy of 96.01%, followed by LGBM (95.26%), DT (91.24%), LR (83.68%), and SVM (83.19%). The use of the Recursive Feature Elimination with Cross-Validation (RFE-CV) feature selection method was able to improve the efficiency of the model by reducing the number of features without significantly degrading performance. The best feature selection was obtained when using 75% features, which provided an optimal balance between the number of features and model accuracy. The most contributing features include IPS_range (Semester GPA range), parents' income, students' regional origin, as well as several other demographic factors. This study contributes to the development of early warning systems in higher education by providing accurate predictive models and identifying key risk factors.

Downloads

Download data is not yet available.

References

Nurmalitasari, Z. Awang Long, and M. Faizuddin Mohd Noor, “Factors Influencing Dropout Students in Higher Education,” Educ. Res. Int., vol. 2023, pp. 1–13, Feb. 2023, doi: 10.1155/2023/7704142.

N. L. Ratniasih, “PENERAPAN ALGORITMA K-NEAREST NEIGHBOUR (K-NN) UNTUK PENENTUAN MAHASISWA BERPOTENSI DROP OUT,” J. Teknol. Inf. Dan Komput., vol. 5, no. 3, Oct. 2019, doi: 10.36002/jutik.v5i3.804.

J. Niyogisubizo, L. Liao, E. Nziyumva, E. Murwanashyaka, and P. C. Nshimyumukiza, “Predicting student’s dropout in university classes using two-layer ensemble machine learning approach: A novel stacked generalization,” Comput. Educ. Artif. Intell., vol. 3, p. 100066, 2022, doi: 10.1016/j.caeai.2022.100066.

C. F. De Oliveira, S. R. Sobral, M. J. Ferreira, and F. Moreira, “How Does Learning Analytics Contribute to Prevent Students’ Dropout in Higher Education: A Systematic Literature Review,” Big Data Cogn. Comput., vol. 5, no. 4, p. 64, Nov. 2021, doi: 10.3390/bdcc5040064.

V. Flores, S. Heras, and V. Julián, “Comparison of Predictive Models with Balanced Classes for the Forecast of Student Dropout in Higher Education,” in Highlights in Practical Applications of Agents, Multi-Agent Systems, and Social Good. The PAAMS Collection, vol. 1472, F. De La Prieta, A. El Bolock, D. Durães, J. Carneiro, F. Lopes, and V. Julian, Eds., in Communications in Computer and Information Science, vol. 1472. , Cham: Springer International Publishing, 2021, pp. 139–152. doi: 10.1007/978-3-030-85710-3_12.

R. Lottering, R. Hans, and M. Lall, “A model for the identification of students at risk of dropout at a university of technology,” in 2020 International Conference on Artificial Intelligence, Big Data, Computing and Data Communication Systems (icABCD), Durban, South Africa: IEEE, Aug. 2020, pp. 1–8. doi: 10.1109/icABCD49160.2020.9183874.

A. Kelly and M. A. Johnson, “Investigating the Statistical Assumptions of Naïve Bayes Classifiers,” in 2021 55th Annual Conference on Information Sciences and Systems (CISS), Baltimore, MD, USA: IEEE, Mar. 2021, pp. 1–6. doi: 10.1109/CISS50987.2021.9400215.

Y. Qiu, J. Zhou, M. Khandelwal, H. Yang, P. Yang, and C. Li, “Performance evaluation of hybrid WOA-XGBoost, GWO-XGBoost and BO-XGBoost models to predict blast-induced ground vibration,” Eng. Comput., vol. 38, no. S5, pp. 4145–4162, Dec. 2022, doi: 10.1007/s00366-021-01393-9.

M. Pagan, M. Zarlis, and A. Candra, “Investigating the impact of data scaling on the k-nearest neighbor algorithm,” Comput. Sci. Inf. Technol., vol. 4, no. 2, pp. 135–142, Jul. 2023, doi: 10.11591/csit.v4i2.p135-142.

S. Huang, M. Huang, and Y. Lyu, “A novel approach for sand liquefaction prediction via local mean-based pseudo nearest neighbor algorithm and its engineering application,” Adv. Eng. Inform., vol. 41, p. 100918, Aug. 2019, doi: 10.1016/j.aei.2019.04.008.

Q. Wang, T.-T. Nguyen, J. Z. Huang, and T. T. Nguyen, “An efficient random forests algorithm for high dimensional data classification,” Adv. Data Anal. Classif., vol. 12, no. 4, pp. 953–972, Dec. 2018, doi: 10.1007/s11634-018-0318-1.

R. Garcia Leiva, A. Fernandez Anta, V. Mancuso, and P. Casari, “A Novel Hyperparameter-Free Approach to Decision Tree Construction That Avoids Overfitting by Design,” IEEE Access, vol. 7, pp. 99978–99987, 2019, doi: 10.1109/ACCESS.2019.2930235.

S. Ray, “A Quick Review of Machine Learning Algorithms,” in 2019 International Conference on Machine Learning, Big Data, Cloud and Parallel Computing (COMITCon), Faridabad, India: IEEE, Feb. 2019, pp. 35–39. doi: 10.1109/COMITCon.2019.8862451.

Y. ElNakieb et al., “The Role of Diffusion Tensor MR Imaging (DTI) of the Brain in Diagnosing Autism Spectrum Disorder: Promising Results,” Sensors, vol. 21, no. 24, p. 8171, Dec. 2021, doi: 10.3390/s21248171.

M. Awad and S. Fraihat, “Recursive Feature Elimination with Cross-Validation with Decision Tree: Feature Selection Method for Machine Learning-Based Intrusion Detection Systems,” J. Sens. Actuator Netw., vol. 12, no. 5, p. 67, Sep. 2023, doi: 10.3390/jsan12050067.

O. Jiménez, A. Jesús, and L. Wong, “Model for the Prediction of Dropout in Higher Education in Peru applying Machine Learning Algorithms: Random Forest, Decision Tree, Neural Network and Support Vector Machine,” in 2023 33rd Conference of Open Innovations Association (FRUCT), Zilina, Slovakia: IEEE, May 2023, pp. 116–124. doi: 10.23919/FRUCT58615.2023.10143068.

J. G. C. Krüger, A. D. S. Britto, and J. P. Barddal, “An explainable machine learning approach for student dropout prediction,” Expert Syst. Appl., vol. 233, p. 120933, Dec. 2023, doi: 10.1016/j.eswa.2023.120933.

A. Chowdhury et al., “Ultrasound classification of breast masses using a comprehensive Nakagami imaging and machine learning framework,” Ultrasonics, vol. 124, p. 106744, Aug. 2022, doi: 10.1016/j.ultras.2022.106744.

J. Kolluri, V. K. Kotte, M. S. B. Phridviraj, and S. Razia, “Reducing Overfitting Problem in Machine Learning Using Novel L1/4 Regularization Method,” in 2020 4th International Conference on Trends in Electronics and Informatics (ICOEI)(48184), Tirunelveli, India: IEEE, Jun. 2020, pp. 934–938. doi: 10.1109/ICOEI48184.2020.9142992.

V. Lumumba, D. Kiprotich, M. Mpaine, N. Makena, and M. Kavita, “Comparative Analysis of Cross-Validation Techniques: LOOCV, K-folds Cross-Validation, and Repeated K-folds Cross-Validation in Machine Learning Models,” Am. J. Theor. Appl. Stat., vol. 13, no. 5, pp. 127–137, Oct. 2024, doi: 10.11648/j.ajtas.20241305.13.

S. A. Alex, J. Jesu Vedha Nayahi, and S. Kaddoura, “Deep convolutional neural networks with genetic algorithm-based synthetic minority over-sampling technique for improved imbalanced data classification,” Appl. Soft Comput., vol. 156, p. 111491, May 2024, doi: 10.1016/j.asoc.2024.111491.

D. Arifah, T. H. Saragih, D. Kartini, M. Muliadi, and M. I. Mazdadi, “Application of SMOTE to Handle Imbalance Class in Deposit Classification Using the Extreme Gradient Boosting Algorithm,” J. Ilm. Tek. Elektro Komput. Dan Inform., vol. 9, no. 2, pp. 396–410, Jun. 2023, doi: 10.26555/jiteki.v9i2.26155.

K. Gajowniczek and M. Dudziński, “Influence of Explanatory Variable Distributions on the Behavior of the Impurity Measures Used in Classification Tree Learning,” Entropy, vol. 26, no. 12, p. 1020, Nov. 2024, doi: 10.3390/e26121020.

M. Kretowski, Evolutionary Decision Trees in Large-Scale Data Mining, vol. 59. in Studies in Big Data, vol. 59. Cham: Springer International Publishing, 2019. doi: 10.1007/978-3-030-21851-5.

H. Zhao, “Research on the Application of Improved Decision Tree Algorithm based on Information Entropy in the Financial Management of Colleges and Universities,” Int. J. Adv. Comput. Sci. Appl., vol. 13, no. 12, 2022, doi: 10.14569/IJACSA.2022.0131284.

P. Gulati, A. Sharma, and M. Gupta, “Theoretical Study of Decision Tree Algorithms to Identify Pivotal Factors for Performance Improvement: A Review,” Int. J. Comput. Appl., vol. 141, no. 14, pp. 19–25, May 2016, doi: 10.5120/ijca2016909926.

Z. Sun, G. Wang, P. Li, H. Wang, M. Zhang, and X. Liang, “An improved random forest based on the classification accuracy and correlation measurement of decision trees,” Expert Syst. Appl., vol. 237, p. 121549, Mar. 2024, doi: 10.1016/j.eswa.2023.121549.

T.-T. Huynh-Cam, L.-S. Chen, and H. Le, “Using Decision Trees and Random Forest Algorithms to Predict and Determine Factors Contributing to First-Year University Students’ Learning Performance,” Algorithms, vol. 14, no. 11, p. 318, Oct. 2021, doi: 10.3390/a14110318.

W. Chen, S. Zhang, R. Li, and H. Shahabi, “Performance evaluation of the GIS-based data mining techniques of best-first decision tree, random forest, and naïve Bayes tree for landslide susceptibility modeling,” Sci. Total Environ., vol. 644, pp. 1006–1018, Dec. 2018, doi: 10.1016/j.scitotenv.2018.06.389.

H. Virro, A. Kmoch, M. Vainu, and E. Uuemaa, “Random forest-based modeling of stream nutrients at national level in a data-scarce region,” Sci. Total Environ., vol. 840, p. 156613, Sep. 2022, doi: 10.1016/j.scitotenv.2022.156613.

L. Torre-Tojal, A. Bastarrika, A. Boyano, J. M. Lopez-Guede, and M. Graña, “Above-ground biomass estimation from LiDAR data using random forest algorithms,” J. Comput. Sci., vol. 58, p. 101517, Feb. 2022, doi: 10.1016/j.jocs.2021.101517.

Md. K. Islam, P. Hridi, Md. S. Hossain, and H. S. Narman, “Network Anomaly Detection Using LightGBM: A Gradient Boosting Classifier,” in 2020 30th International Telecommunication Networks and Applications Conference (ITNAC), Melbourne, VIC, Australia: IEEE, Nov. 2020, pp. 1–7. doi: 10.1109/ITNAC50341.2020.9315049.

D. Zhang and Y. Gong, “The Comparison of LightGBM and XGBoost Coupling Factor Analysis and Prediagnosis of Acute Liver Failure,” IEEE Access, vol. 8, pp. 220990–221003, 2020, doi: 10.1109/ACCESS.2020.3042848.

M. Hajihosseinlou, A. Maghsoudi, and R. Ghezelbash, “A Novel Scheme for Mapping of MVT-Type Pb–Zn Prospectivity: LightGBM, a Highly Efficient Gradient Boosting Decision Tree Machine Learning Algorithm,” Nat. Resour. Res., vol. 32, no. 6, pp. 2417–2438, Dec. 2023, doi: 10.1007/s11053-023-10249-6.

T. O. Omotehinwa, D. O. Oyewola, and E. G. Dada, “A Light Gradient-Boosting Machine algorithm with Tree-Structured Parzen Estimator for breast cancer diagnosis,” Healthc. Anal., vol. 4, p. 100218, Dec. 2023, doi: 10.1016/j.health.2023.100218.

K. Shah, H. Patel, D. Sanghvi, and M. Shah, “A Comparative Analysis of Logistic Regression, Random Forest and KNN Models for the Text Classification,” Augment. Hum. Res., vol. 5, no. 1, p. 12, Dec. 2020, doi: 10.1007/s41133-020-00032-0.

J. Phillips, E. Cripps, J. W. Lau, and M. R. Hodkiewicz, “Classifying machinery condition using oil samples and binary logistic regression,” Mech. Syst. Signal Process., vol. 60–61, pp. 316–325, Aug. 2015, doi: 10.1016/j.ymssp.2014.12.020.

C. Starbuck, “Logistic Regression,” in The Fundamentals of People Analytics, Cham: Springer International Publishing, 2023, pp. 223–238. doi: 10.1007/978-3-031-28674-2_12.

S. Ghosh, A. Dasgupta, and A. Swetapadma, “A Study on Support Vector Machine based Linear and Non-Linear Pattern Classification,” in 2019 International Conference on Intelligent Sustainable Systems (ICISS), Palladam, Tamilnadu, India: IEEE, Feb. 2019, pp. 24–28. doi: 10.1109/ISS1.2019.8908018.

S. F. Hussain, “A novel robust kernel for classifying high-dimensional data using Support Vector Machines,” Expert Syst. Appl., vol. 131, pp. 116–131, Oct. 2019, doi: 10.1016/j.eswa.2019.04.037.

Md. S. Reza, U. Hafsha, R. Amin, R. Yasmin, and S. Ruhi, “Improving SVM performance for type II diabetes prediction with an improved non-linear kernel: Insights from the PIMA dataset,” Comput. Methods Programs Biomed. Update, vol. 4, p. 100118, 2023, doi: 10.1016/j.cmpbup.2023.100118.

T. Evgeniou and M. Pontil, “Support Vector Machines: Theory and Applications,” in Machine Learning and Its Applications, vol. 2049, G. Paliouras, V. Karkaletsis, and C. D. Spyropoulos, Eds., in Lecture Notes in Computer Science, vol. 2049. , Berlin, Heidelberg: Springer Berlin Heidelberg, 2001, pp. 249–257. doi: 10.1007/3-540-44673-7_12.

R. Nariswari and H. Pudjihastuti, “Support Vector Machine Method for Predicting Non-Linear Data,” Procedia Comput. Sci., vol. 227, pp. 884–891, 2023, doi: 10.1016/j.procs.2023.10.595.

X. Zhou, P. Lu, Z. Zheng, D. Tolliver, and A. Keramati, “Accident Prediction Accuracy Assessment for Highway-Rail Grade Crossings Using Random Forest Algorithm Compared with Decision Tree,” Reliab. Eng. Syst. Saf., vol. 200, p. 106931, Aug. 2020, doi: 10.1016/j.ress.2020.106931.

V. Chang, M. A. Ganatra, K. Hall, L. Golightly, and Q. A. Xu, “An assessment of machine learning models and algorithms for early prediction and diagnosis of diabetes using health indicators,” Healthc. Anal., vol. 2, p. 100118, Nov. 2022, doi: 10.1016/j.health.2022.100118.

D. Chicco and G. Jurman, “The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation,” BMC Genomics, vol. 21, no. 1, p. 6, Jan. 2020, doi: 10.1186/s12864-019-6413-7.

V. Tsoukas, K. Kolomvatsos, V. Chioktour, and A. Kakarountas, “A Comparative Assessment of Machine Learning Algorithms for Events Detection,” in 2019 4th South-East Europe Design Automation, Computer Engineering, Computer Networks and Social Media Conference (SEEDA-CECNSM), Sep. 2019, pp. 1–4. doi: 10.1109/SEEDA-CECNSM.2019.8908366.

T. Baalmann, A. Brömmelhaus, J. Hülsemann, M. Feldhaus, and K. Speck, “The Impact of Parents, Intimate Relationships, and Friends on Students’ Dropout Intentions,” J. Coll. Stud. Retent. Res. Theory Pract., vol. 26, no. 3, pp. 923–947, Nov. 2024, doi: 10.1177/15210251221133374.

C. Davidson and K. Wilson, “Reassessing Tinto’s Concepts of Social and Academic Integration in Student Retention,” J. Coll. Stud. Retent. Res. Theory Pract., vol. 15, no. 3, pp. 329–346, Nov. 2013, doi: 10.2190/CS.15.3.b.