Optimizing Type 2 Diabetes Classification with Feature Selection and Class Balancing in Machine Learning
DOI:
https://doi.org/10.52436/1.jutif.2025.6.4.5166Keywords:
Diabetes, Feature selection, Imbalance class, Machine LearningAbstract
Type 2 Diabetes (T2DM) is a crucial factor in patient survival and treatment effectiveness. Errors in diabetes detection lead to disease severity, high costs, prolonged healing time, and a decline in service quality. Additionally, a major challenge in developing Machine Learning (ML)-based detection decision support systems is the class imbalance in medical data as well as the high feature dimensionality that can affect the accuracy and efficiency of the model. This research proposes an approach based on feature selection (FS) and handling class imbalance to improve performance in type 2 diabetes. Several feature selection techniques such as Information Gain (IG), Gain Ratio (GR), Gini Decrease (GD), Chi-Square (CS), Relief-F, and FCBF can perform feature selection based on weighting ranking. Furthermore, to address the imbalanced class distribution, we utilize the Synthetic Minority Over-Sampling Technique (SMOTE). ML classification models such as Support Vector Machine (SVM), Gradient Boosting (GB), Tree, Neural Network (NN), Random Forest (RF), and AdaBoost were tested and evaluated based on the confusion matrix including accuracy, precision, recall, and time. The experimental results show that the combination of strategies for handling imbalanced classes significantly improves the predictive performance of ML algorithms. In addition, we found that the combination of feature selection techniques IG+AdaBoost consistently demonstrates optimal performance. This study emphasizes the importance of data preprocessing and the selection of the right algorithms in the development of machine learning-based T2DM detection systems. Accurate detection can reduce the severity of disease, lower treatment costs, speed up the healing process, and improve healthcare services.
Downloads
References
International Diabetes Federation, “International Diabetes Federation - Complications,” Idf.Org. p. 1, 2019. Accessed: Apr. 18, 2020. [Online]. Available: https://www.idf.org/aboutdiabetes/complications.html
American Diabetes Association, “Pharmacologic Approaches to Glycemic Treatment: Standards of Medical Care in Diabetes-2020,” Diabetes care, vol. 43, no. January. pp. S98–S110, 2020. doi: 10.2337/dc20-S009.
A. A. Yameny, “Diabetes Mellitus Overview 2024,” J. Biosci. Appl. Res., vol. 10, no. 3, pp. 641–645, 2024, doi: 10.21608/jbaar.2024.382794.
F. A. Ibrahim and O. A. Shiba, “Data Mining : WEKA Software ( an Overview ),” J. Pure Appl. Sci., vol. 18, no. 3, pp. 54–58, 2019, [Online]. Available: www.Suj.sebhau.edu.ly
H. Guan et al., “The role of machine learning in advancing diabetic foot: a review,” Front. Endocrinol. (Lausanne)., vol. 15, no. April, pp. 1–15, 2024, doi: 10.3389/fendo.2024.1325434.
Y.-M. Huang et al., “Correction: Huang et al. Systemic Anticoagulation and Inpatient Outcomes of Pancreatic Cancer: Real-World Evidence from U.S. Nationwide Inpatient Sample. Cancers 2023, 15, 1985,” Cancers, vol. 16, no. 6. 2024. doi: 10.3390/cancers16061181.
N. P. Tigga and S. Garg, “Prediction of Type 2 Diabetes using Machine Learning Classification Methods,” Procedia Comput. Sci., vol. 167, pp. 706–716, 2020, doi: https://doi.org/10.1016/j.procs.2020.03.336.
S. Li, Z. Tang, L. Yang, M. Li, and Z. Shang, “Application of deep reinforcement learning for spike sorting under multi-class imbalance,” Comput. Biol. Med., vol. 164, p. 107253, 2023, doi: https://doi.org/10.1016/j.compbiomed.2023.107253.
X. Song et al., “Evolutionary computation for feature selection in classification: A comprehensive survey of solutions, applications and challenges,” Swarm Evol. Comput., vol. 90, p. 101661, 2024, doi: https://doi.org/10.1016/j.swevo.2024.101661.
L. C. M. Liaw, S. C. Tan, P. Y. Goh, and C. P. Lim, “A histogram SMOTE-based sampling algorithm with incremental learning for imbalanced data classification,” Inf. Sci. (Ny)., vol. 686, p. 121193, 2025, doi: https://doi.org/10.1016/j.ins.2024.121193.
G. Husain et al., “SMOTE vs. SMOTEENN: A Study on the Performance of Resampling Algorithms for Addressing Class Imbalance in Regression Models,” Algorithms, vol. 18, no. 1, pp. 1–16, 2025, doi: 10.3390/a18010037.
M. F. Ijaz, G. Alfian, M. Syafrudin, and J. Rhee, “Hybrid Prediction Model for type 2 diabetes and hypertension using DBSCAN-based outlier detection, Synthetic Minority Over Sampling Technique (SMOTE), and random forest,” Appl. Sci., vol. 8, no. 8, 2018, doi: 10.3390/app8081325.
H. Sulistiani, A. Syarif, K. Muludi, and Warsito, “Performance evaluation of feature selections on some ML approaches for diagnosing the narcissistic personality disorder,” Bull. Electr. Eng. Informatics, vol. 13, no. 2, pp. 1383–1391, 2024, doi: 10.11591/eei.v13i2.6717.
J. Wang, S. Zhou, Y. Yi, and J. Kong, “An improved feature selection based on effective range for classification,” Sci. World J., vol. 2014, 2014, doi: 10.1155/2014/972125.
S. Bashir, Z. S. Khan, F. H. Khan, A. Anjum, and K. Bashir, “Improving Heart Disease Prediction Using Feature Selection Approaches,” in 2019 16th International Bhurban Conference on Applied Sciences and Technology (IBCAST), 2019, pp. 619–623. doi: 10.1109/IBCAST.2019.8667106.
J. Gao, Z. Wang, T. Jin, J. Cheng, Z. Lei, and S. Gao, “Information gain ratio-based subfeature grouping empowers particle swarm optimization for feature selection,” Knowledge-Based Syst., vol. 286, p. 111380, 2024, doi: https://doi.org/10.1016/j.knosys.2024.111380.
P. Bhat and K. Dutta, “A multi-tiered feature selection model for android malware detection based on Feature discrimination and Information Gain,” J. King Saud Univ. - Comput. Inf. Sci., vol. 34, no. 10, Part B, pp. 9464–9477, 2022, doi: https://doi.org/10.1016/j.jksuci.2021.11.004.
M. Trabelsi, N. Meddouri, and M. Maddouri, “A New Feature Selection Method for Nominal Classifier based on Formal Concept Analysis,” Procedia Comput. Sci., vol. 112, pp. 186–194, 2017, doi: 10.1016/j.procs.2017.08.227.
Y. Sang and X. Dang, “Grouped feature screening for ultrahigh-dimensional classification via Gini distance correlation,” J. Multivar. Anal., vol. 204, pp. 1–25, 2024, doi: 10.1016/j.jmva.2024.105360.
Y. Zhang et al., “Feature selection based on neighborhood rough sets and Gini index,” PeerJ Comput. Sci., vol. 9, p. e1711, 2023, doi: 10.7717/peerj-cs.1711.
A. Abdo, R. Mostafa, and L. Abdel-Hamid, “An Optimized Hybrid Approach for Feature Selection Based on Chi-Square and Particle Swarm Optimization Algorithms,” Data, vol. 9, no. 2. 2024. doi: 10.3390/data9020020.
Y. Liu, J. Zhang, and L. Ma, “A fault diagnosis approach for diesel engines based on self-adaptive WVD, improved FCBF and PECOC-RVM,” Neurocomputing, vol. 177, pp. 600–611, 2016, doi: https://doi.org/10.1016/j.neucom.2015.11.074.
N. Aggarwal et al., “Mean based relief: An improved feature selection method based on ReliefF,” Appl. Intell., vol. 53, no. 19, pp. 23004–23028, 2023, doi: 10.1007/s10489-023-04662-w.
T. Yan, S.-L. Shen, A. Zhou, and X. Chen, “Prediction of geological characteristics from shield operational parameters by integrating grid search and K-fold cross validation into stacking classification algorithm,” J. Rock Mech. Geotech. Eng., vol. 14, no. 4, pp. 1292–1303, 2022, doi: https://doi.org/10.1016/j.jrmge.2022.03.002.
M. Ohsaki, P. Wang, K. Matsuda, S. Katagiri, H. Watanabe, and A. Ralescu, “Confusion-matrix-based kernel logistic regression for imbalanced data classification,” IEEE Trans. Knowl. Data Eng., vol. 29, no. 9, pp. 1806–1819, 2017, doi: 10.1109/TKDE.2017.2682249.
I. Popchev and D. Orozova, “Algorithms for Machine Learning with Orange System,” Int. J. online Biomed. Eng., vol. 19, no. 4, pp. 109–123, 2023, doi: 10.3991/ijoe.v19i04.36897.
R. D. Joshi and C. K. Dhakal, “Predicting Type 2 Diabetes Using Logistic Regression and Machine Learning Approaches,” International Journal of Environmental Research and Public Health, vol. 18, no. 14. 2021. doi: 10.3390/ijerph18147346.
N. Fazakis, O. Kocsis, E. Dritsas, S. Alexiou, N. Fakotakis, and K. Moustakas, “Machine Learning Tools for Long-Term Type 2 Diabetes Risk Prediction,” IEEE Access, vol. 9, pp. 103737–103757, 2021, doi: 10.1109/ACCESS.2021.3098691.
R. Islam, A. Sultana, M. N. Tuhin, M. S. H. Saikat, and M. R. Islam, “Clinical Decision Support System for Diabetic Patients by Predicting Type 2 Diabetes Using Machine Learning Algorithms,” J. Healthc. Eng., vol. 2023, no. 1, p. 6992441, Jan. 2023, doi: https://doi.org/10.1155/2023/6992441.
M. M. Islam et al., “Identification of the risk factors of type 2 diabetes and its prediction using machine learning techniques,” Heal. Syst., vol. 12, no. 2, pp. 243–254, Apr. 2023, doi: 10.1080/20476965.2022.2141141.
O. Iparraguirre-Villanueva, K. Espinola-Linares, R. O. Flores Castañeda, and M. Cabanillas-Carbonell, “Application of Machine Learning Models for Early Detection and Accurate Classification of Type 2 Diabetes,” Diagnostics, vol. 13, no. 14. 2023. doi: 10.3390/diagnostics13142383.
A. Agliata, D. Giordano, F. Bardozzo, S. Bottiglieri, A. Facchiano, and R. Tagliaferri, “Machine Learning as a Support for the Diagnosis of Type 2 Diabetes,” International Journal of Molecular Sciences, vol. 24, no. 7. 2023. doi: 10.3390/ijms24076775.
M. Lugner, A. Rawshani, E. Helleryd, and B. Eliasson, “Identifying top ten predictors of type 2 diabetes through machine learning analysis of UK Biobank data,” Sci. Rep., vol. 14, no. 1, p. 2102, 2024, doi: 10.1038/s41598-024-52023-5
Additional Files
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Agus Wantoro, Aviv Fitria Yuliana, Dwi Yana Ayu Andini, Ikna Awaliyani, Wahyu Caesarendra

This work is licensed under a Creative Commons Attribution 4.0 International License.