Improving Diabetes Prediction Performance Using Random Forest Classifier with Hyperparameter Tuning

Novita Lestari Anggreini; Ade Yuliana; Dadan Saepul Ramdan; Wissam Al-Dayyeni

doi:10.52436/1.jutif.2025.6.4.4755

Authors

Novita Lestari Anggreini Informatics Engineering, Politeknik TEDC, Indonesia
Ade Yuliana Informatics Engineering, Politeknik TEDC, Indonesia
Dadan Saepul Ramdan Informatics Engineering, Politeknik TEDC, Indonesia
Wissam Al-Dayyeni Electrical Engineering, ADA Univresity, Azerbaijan

DOI:

https://doi.org/10.52436/1.jutif.2025.6.4.4755

Keywords:

Data Mining, Diabetes Prediction, Hyperparameter Tuning, Model Enhancement, Random Forest Classifier

Abstract

Diabetes mellitus is a chronic metabolic disorder that poses a serious challenge to global healthcare systems due to its increasing prevalence and the high costs associated with treatment. Although machine learning has been widely adopted to support early diagnosis, many predictive models still underperform due to limited preprocessing strategies and inefficient hyperparameter settings. This study proposes a comprehensive machine learning pipeline to enhance diabetes prediction accuracy by utilizing a Random Forest classifier optimized through systematic hyperparameter tuning. The novelty of this method lies in its integrated approach, which includes thorough preprocessing such as removing duplicate records, handling inconsistent unique values, addressing missing data, and applying the SMOTE technique to overcome class imbalance. Additionally, hyperparameter tuning is conducted using GridSearchCV combined with 5-fold cross-validation, and only the most influential features are selected to improve model interpretability and efficiency. The proposed model achieved an accuracy of 95 percent, with a recall of 0.88 and an F1-score of 0.85, indicating its robustness in identifying diabetic cases more effectively than previous studies using standard machine learning algorithms. This model contributes to the development of a reliable and scalable early detection system for diabetes, applicable in clinical decision support environments. Further refinement can be achieved by testing on larger and more diverse datasets or by implementing more efficient tuning techniques such as Bayesian optimization.

Downloads

Download data is not yet available.

References

Y. Wang and D. J. Magliano, “Special Issue: ‘New Trends in Diabetes, Hypertension, and Cardiovascular Diseases,’” Mar. 01, 2024, Multidisciplinary Digital Publishing Institute (MDPI). doi: 10.3390/ijms25052711.

M. D. Butt et al., “A systematic review of the economic burden of diabetes mellitus: contrasting perspectives from high and low middle-income countries,” J Pharm Policy Pract, vol. 17, no. 1, Dec. 2024, doi: 10.1080/20523211.2024.2322107.

W. Bielka, A. Przezak, P. Molęda, E. Pius-Sadowska, and B. Machaliński, “Double diabetes—when type 1 diabetes meets type 2 diabetes: definition, pathogenesis and recognition,” Cardiovasc Diabetol, vol. 23, no. 1, p. 62, Feb. 2024, doi: 10.1186/s12933-024-02145-x.

I. Hernar et al., “Diabetes Distress and Associations With Demographic and Clinical Variables: A Nationwide Population-Based Registry Study of 10,186 Adults With Type 1 Diabetes in Norway,” Diabetes Care, vol. 47, no. 1, pp. 126–131, Jan. 2024, doi: 10.2337/dc23-1001.

N. Hermanns, B. Kulzer, and D. Ehrmann, “Person‐reported outcomes in diabetes care: What are they and why are they so important?,” Diabetes Obes Metab, vol. 26, no. S1, pp. 30–45, Mar. 2024, doi: 10.1111/dom.15471.

I. P. Kamila, C. A. Sari, E. H. Rachmawanto, and N. R. D. Cahyo, “A Good Evaluation Based on Confusion Matrix for Lung Diseases Classification using Convolutional Neural Networks,” Advance Sustainable Science, Engineering and Technology, vol. 6, no. 1, p. 0240102, Dec. 2023, doi: 10.26877/asset.v6i1.17330.

R. J. Porter, M. J. Arends, A. M. D. Churchhouse, and S. Din, “Inflammatory Bowel Disease-Associated Colorectal Cancer: Translational Risks from Mechanisms to Medicines,” J Crohns Colitis, vol. 15, no. 12, pp. 2131–2141, Dec. 2021, doi: 10.1093/ecco-jcc/jjab102.

N. R. D. Cahyo and M. M. I. Al-Ghiffary, “An Image Processing Study: Image Enhancement, Image Segmentation, and Image Classification using Milkfish Freshness Images,” IJECAR) International Journal of Engineering Computing Advanced Research, vol. 1, no. 1, pp. 11–22, 2024.

F. Farhan, C. A. Sari, E. H. Rachmawanto, and N. R. D. Cahyo, “Mangrove Tree Species Classification Based on Leaf, Stem, and Seed Characteristics Using Convolutional Neural Networks with K-Folds Cross Validation Optimalization,” Advance Sustainable Science Engineering and Technology, vol. 5, no. 3, p. 02303011, Oct. 2023, doi: 10.26877/asset.v5i3.17188.

M. M. I. Al-Ghiffary, N. R. D. Cahyo, E. H. Rachmawanto, C. Irawan, and N. Hendriyanto, “Adaptive deep learning based on FaceNet convolutional neural network for facial expression recognition,” Journal of Soft Computing, vol. 05, no. 03, pp. 271–280, 2024, doi: https://doi.org/10.52465/joscex.v5i3.450.

M. M. I. Al-Ghiffary, C. A. Sari, E. H. Rachmawanto, N. M. Yacoob, N. R. D. Cahyo, and R. R. Ali, “Milkfish Freshness Classification Using Convolutional Neural Networks Based on Resnet50 Architecture,” Advance Sustainable Science Engineering and Technology, vol. 5, no. 3, p. 0230304, Oct. 2023, doi: 10.26877/asset.v5i3.17017.

Z. Salahuddin, H. C. Woodruff, A. Chatterjee, and P. Lambin, “Transparency of deep neural networks for medical image analysis: A review of interpretability methods,” Jan. 01, 2022, Elsevier Ltd. doi: 10.1016/j.compbiomed.2021.105111.

M. Xiao, L. Zhang, W. Shi, J. Liu, W. He, and Z. Jiang, “A visualization method based on the Grad-CAM for medical image segmentation model,” in 2021 International Conference on Electronic Information Engineering and Computer Science (EIECS), IEEE, Sep. 2021, pp. 242–247. doi: 10.1109/EIECS53707.2021.9587953.

A. D. Amirruddin, F. M. Muharam, M. H. Ismail, N. P. Tan, and M. F. Ismail, “Synthetic Minority Over-sampling TEchnique (SMOTE) and Logistic Model Tree (LMT)-Adaptive Boosting algorithms for classifying imbalanced datasets of nutrient and chlorophyll sufficiency levels of oil palm (Elaeis guineensis) using spectroradiometers and unmanned aerial vehicles,” Comput Electron Agric, vol. 193, p. 106646, Feb. 2022, doi: 10.1016/j.compag.2021.106646.

N. R. D. Cahyo, C. A. Sari, E. H. Rachmawanto, C. Jatmoko, R. R. A. Al-Jawry, and M. A. Alkhafaji, “A Comparison of Multi Class Support Vector Machine vs Deep Convolutional Neural Network for Brain Tumor Classification,” in 2023 International Seminar on Application for Technology of Information and Communication (iSemantic), IEEE, Sep. 2023, pp. 358–363. doi: 10.1109/iSemantic59612.2023.10295336.

A. J. Albert, R. Murugan, and T. Sripriya, “Diagnosis of heart disease using oversampling methods and decision tree classifier in cardiology,” Research on Biomedical Engineering, vol. 39, no. 1, pp. 99–113, Dec. 2022, doi: 10.1007/s42600-022-00253-9.

X. Liu, M. Pedersen, and R. Wang, “Survey of natural image enhancement techniques: Classification, evaluation, challenges, and perspectives,” Digit Signal Process, vol. 127, p. 103547, 2022, doi: https://doi.org/10.1016/j.dsp.2022.103547.

M. J. Lakshmi and S. Nagaraja Rao, “Brain tumor magnetic resonance image classification: a deep learning approach,” Soft comput, vol. 26, no. 13, pp. 6245–6253, Jul. 2022, doi: 10.1007/s00500-022-07163-z.

Z. He, “Deep Learning in Image Classification: A Survey Report,” in Proceedings - 2020 2nd International Conference on Information Technology and Computer Application, ITCA 2020, Institute of Electrical and Electronics Engineers Inc., Dec. 2020, pp. 174–177. doi: 10.1109/ITCA52113.2020.00043.

W. Yu, C. Y. Wong, R. Chavez, and M. A. Jacobs, “Integrating big data analytics into supply chain finance: The roles of information processing and data-driven culture,” Int J Prod Econ, vol. 236, p. 108135, Jun. 2021, doi: 10.1016/j.ijpe.2021.108135.

A. V. Poznyak, L. Litvinova, P. Poggio, V. N. Sukhorukov, and A. N. Orekhov, “Effect of Glucose Levels on Cardiovascular Risk,” Cells, vol. 11, no. 19, p. 3034, Sep. 2022, doi: 10.3390/cells11193034.

L. Ismail and H. Materwala, “Comparative Analysis of Machine Learning Models for Diabetes Mellitus Type 2 Prediction,” in 2020 International Conference on Computational Science and Computational Intelligence (CSCI), IEEE, Dec. 2020, pp. 527–533. doi: 10.1109/CSCI51800.2020.00095.

B. S. Ahamed, M. S. Arya, and A. O. V. Nancy, “Diabetes Mellitus Disease Prediction Using Machine Learning Classifiers with Oversampling and Feature Augmentation,” Advances in Human-Computer Interaction, vol. 2022, pp. 1–14, Sep. 2022, doi: 10.1155/2022/9220560.

A. Sudhakar, S. S, S. M., S. A, B. Subramanian, and V. Ramana K, “Bayesian Optimization for Hyperparameter Tuning in Healthcare for Diabetes Prediction,” Informing Science: The International Journal of an Emerging Transdiscipline, vol. 28, p. 008, 2025, doi: 10.28945/5445.

U. Hasanah, A. M. Soleh, and K. Sadik, “Effect of Random Under sampling, Oversampling, and SMOTE on the Performance of Cardiovascular Disease Prediction Models,” Jurnal Matematika, Statistika dan Komputasi, vol. 21, no. 1, pp. 88–102, Sep. 2024, doi: 10.20956/j.v21i1.35552.

M. F. Muzakki, R. D. Prayogo, and M. A. Rizky A, “Handling Imbalanced Data for Acute Coronary Syndrome Classification Based on Ensemble and K-Means SMOTE Method,” JOIV : International Journal on Informatics Visualization, vol. 7, no. 3–2, p. 1989, Nov. 2023, doi: 10.30630/joiv.7.3-2.1429.

A. H. Victoria and G. Maragatham, “Automatic tuning of hyperparameters using Bayesian optimization,” Evolving Systems, vol. 12, no. 1, pp. 217–223, Mar. 2021, doi: 10.1007/s12530-020-09345-2.

S. Rahman, M. Ramli, F. Arnia, R. Muharar, and A. Sembiring, “Performance analysis of mAlexnet by training option and activation function tuning on parking images,” IOP Conf Ser Mater Sci Eng, vol. 1087, no. 1, p. 012084, Feb. 2021, doi: 10.1088/1757-899x/1087/1/012084.

B. Arjmand et al., “Machine Learning: A New Prospect in Multi-Omics Data Analysis of Cancer,” Jan. 27, 2022, Frontiers Media S.A. doi: 10.3389/fgene.2022.824451.

M. A. Rasyidi, T. Bariyah, Y. I. Riskajaya, and A. D. Septyani, “Classification of handwritten javanese script using random forest algorithm,” Bulletin of Electrical Engineering and Informatics, vol. 10, no. 3, pp. 1308–1315, Jun. 2021, doi: 10.11591/eei.v10i3.3036.

E. H. Rachmawanto, D. R. I. M. Setiadi, N. Rijati, A. Susanto, I. U. W. Mulyono, and H. Rahmalan, “Attribute Selection Analysis for the Random Forest Classification in Unbalanced Diabetes Dataset,” in 2021 International Seminar on Application for Technology of Information and Communication (iSemantic), 2021, pp. 82–86. doi: 10.1109/iSemantic52711.2021.9573181.

M. Daviran, M. Shamekhi, R. Ghezelbash, and A. Maghsoudi, “Landslide susceptibility prediction using artificial neural networks, SVMs and random forest: hyperparameters tuning by genetic optimization algorithm,” International Journal of Environmental Science and Technology, vol. 20, no. 1, pp. 259–276, Jan. 2023, doi: 10.1007/s13762-022-04491-3.

C. Umam, L. B. Handoko, and F. O. Isinkaye, “Performance Analysis of Support Vector Classification and Random Forest in Phishing Email Classification,” Scientific Journal of Informatics, vol. 11, no. 2, pp. 367–374, May 2024, doi: 10.15294/sji.v11i2.3301.

M. P. K. Dewi and E. B. Setiawan, “Feature Expansion Using Word2vec for Hate Speech Detection on Indonesian Twitter with Classification Using SVM and Random Forest,” JURNAL MEDIA INFORMATIKA BUDIDARMA, vol. 6, no. 2, p. 979, Apr. 2022, doi: 10.30865/mib.v6i2.3855.

O. Somantri, R. H. Maharrani, and S. Purwaningrum, “Coastal Sentiment Review Using Naïve Bayes with Feature Selection Genetic Algorithm,” Scientific Journal of Informatics, vol. 10, no. 3, pp. 229–238, Jun. 2023, doi: 10.15294/sji.v10i3.43988.

Priyanka and D. Kumar, “Feature Extraction and Selection of kidney Ultrasound Images Using GLCM and PCA,” in Procedia Computer Science, Elsevier B.V., 2020, pp. 1722–1731. doi: 10.1016/j.procs.2020.03.382.

F. Z. BOUKHOBZA, A. HACINE GHARBI, and K. ROUABAH, “A New Facial Expression Recognition Algorithm Based on DWT Feature Extraction and Selection,” The International Arab Journal of Information Technology, vol. 21, no. 4, 2024, doi: 10.34028/iajit/21/4/6.

N. A. Samee, G. Atteia, S. Meshoul, M. A. Al-antari, and Y. M. Kadah, “Deep Learning Cascaded Feature Selection Framework for Breast Cancer Classification: Hybrid CNN with Univariate-Based Approach,” Mathematics, vol. 10, no. 19, Oct. 2022, doi: 10.3390/math10193631.

B. Zhang, X. Chen, X. Cui, and M. Shen, “A Novel Bias-Adjusted Estimator Based on Synthetic Confusion Matrix (BAESCM) for Subregion Area Estimation,” Remote Sens (Basel), vol. 17, no. 7, p. 1145, Mar. 2025, doi: 10.3390/rs17071145.

C.-L. Fan, “Evaluation Model for Crack Detection with Deep Learning: Improved Confusion Matrix Based on Linear Features,” J Constr Eng Manag, vol. 151, no. 3, Mar. 2025, doi: 10.1061/JCEMD4.COENG-14976.