Evaluating Synthetic Minority Oversampling Technique Strategies for Diabetes Mellitus Classification using K-Nearest Neighbors Algorithm
DOI:
https://doi.org/10.52436/1.jutif.2025.6.5.5189Keywords:
Cross-Validation, Diabetes Mellitus, K-Nearest Neighbors, Medical Classification, SMOTEAbstract
Data-driven classification of Diabetes Mellitus is a crucial strategy in developing medical decision support systems that are both accurate and efficient. A major challenge in this classification task is the imbalanced class distribution, which tends to reduce the model’s sensitivity to positive cases. This research utilizes a dataset of 1,000 patient medical records obtained from the Mendeley Data repository, containing clinical attributes relevant to diabetes diagnosis. This research examines the impact of various K values on the K-Nearest Neighbors (KNN) algorithm when it is combined with the SMOTE oversampling technique to enhance classification performance. The experiment employs a 10-Fold Cross-Validation methodology with five principal assessment metrics: accuracy, precision, recall, F1-score, and Area Under Curve (AUC). Compared to prior studies, this work advances the methodology by applying SMOTE within each fold of the cross-validation process, effectively preventing data leakage and improving model generalizability. Results indicate that the K=3 configuration yields the highest F1-score of 95.13% and recall of 91.83%, while the highest AUC of 96.40% is achieved at K=9 with lower sensitivity. Applying SMOTE within each fold of the cross-validation process preserves evaluation integrity and prevents potential data leakage. The model demonstrates the ability to detect positive cases more effectively while maintaining high precision. These findings highlight that combining KNN with SMOTE and proper validation strategy is a promising approach for developing a reliable early detection system for Diabetes Mellitus that is adaptive to imbalanced clinical data.
Downloads
References
A. I. ElSeddawy, F. K. Karim, A. M. Hussein, and D. S. Khafaga, “Predictive Analysis of Diabetes-Risk with Class Imbalance,” Computational Intelligence and Neuroscience., vol. 2022, pp. 1–16, Oct. 2022, doi: 10.1155/2022/3078025.
M. N. Abdullah and Y. B. Wah, “Improving Diabetes Mellitus Prediction with MICE and SMOTE for Imbalanced Data,” in 2022 3rd International Conference on Artificial Intelligence and Data Sciences (AiDAS), IEEE, Sep. 2022, pp. 209–214. doi: 0.1109/AiDAS56890.2022.9918773.
A. Wibowo, A. F. N. Masruriyah, and S. Rahmawati, “Refining Diabetes Diagnosis Models: The Impact of SMOTE on SVM, Logistic Regression, and Naïve Bayes,” Journal of Electronics, Electromedical Engineering, and Medical Informatics., vol. 7, no. 1, pp. 197–207, Jan. 2025, doi: 10.35882/jeeemi.v7i1.596.
A. J. Mohammed, M. M. Hassan, and D. H. Kadir, “Improving Classification Performance for a Novel Imbalanced Medical Dataset using SMOTE Method,” International Journal of Advanced Trends in Computer Science and Engineering., vol. 9, no. 3, pp. 3161–3172, Jun. 2020, doi: 10.30534/ijatcse/2020/104932020.
F. Arsyadani and A. Purwinarko, “Implementation of Synthetic Minority Oversampling Technique and Two-phase Mutation Grey Wolf Optimization on Early Diagnosis of Diabetes using K-Nearest Neighbors,” Recursive Journal of Informatics, vol. 1, no. 1, pp. 9–17, Mar. 2023, doi: 10.15294/rji.v1i1.64406.
D. R. Damayanti and A. Purwinarko, “Application of C4.5 Algorithm Using Synthetic Minority Oversampling Technique (SMOTE) and Particle Swarm Optimization (PSO) for Diabetes Prediction,” Recursive Journal of Informatics, vol. 2, no. 1, pp. 18–27, Mar. 2024, doi: 10.15294/rji.v2i1.64928.
K. H. Abushahla and M. A. Pala, “Optimizing Diabetes Prediction: Addressing Data Imbalance with Machine Learning Algorithms,” ADBA Computer Science., Jul. 2024, doi: 10.69882/adba.cs.2024075.
R. Taher, S. H. Basha, and A. Abdalla, “Improving Machine Learning Techniques with Imbalanced Data Treatment for Predicting Diabetes,” in Lecture Notes on Data Engineering and Communications Technologies, vol. 184, Springer Science and Business Media Deutschland GmbH, 2023, pp. 380–391. doi: 10.1007/978-3-031-43247-7_34.
I. Leguen-de-Varona, J. Madera, H. Gonzalez, L. Tubex, and T. Verdonck, “Oversampling Method Based Covariance Matrix Estimation in High-Dimensional Imbalanced Classification,” in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 14335 LNCS, Springer Science and Business Media Deutschland GmbH, 2024, pp. 16–23. doi: 10.1007/978-3-031-49552-6_2.
A. Hashmi, M. T. Nafis, S. Naaz, and I. Hussain, “Comparative Analysis of Resampling Techniques and Machine Learning Classifiers in Multiclass Classification of Diabetes Mellitus,” in 2023 International Conference on Self Sustainable Artificial Intelligence Systems (ICSSAS), IEEE, Oct. 2023, pp. 230–238. doi: 10.1109/ICSSAS57918.2023.10331822.
V. Pratap and A. P. Singh, “A Comparative Analysis of Classification Methods Using Oversampling Methods for Diabetes Dataset,” in 2023 3rd International Conference on Advancement in Electronics & Communication Engineering (AECE), IEEE, Nov. 2023, pp. 921–926. doi: 10.1109/AECE59614.2023.10428438.
A. Prastyo, S. Sutikno, and K. Khadijah, “Improving support vector machine and backpropagation performance for diabetes mellitus classification,” Computer Science and Information Technology., vol. 5, no. 2, pp. 140–149, Jul. 2024, doi: 10.11591/csit.v5i2.p140-149.
N. M. Nayan, A. Islam, M. U. Islam, E. Ahmed, M. M. Hossain, and M. Z. Alam, “SMOTE Oversampling and Near Miss Undersampling Based Diabetes Diagnosis from Imbalanced Dataset with XAI Visualization,” in 2023 IEEE Symposium on Computers and Communications (ISCC), IEEE, Jul. 2023, pp. 1–6. doi: 10.1109/ISCC58397.2023.10218281.
Q. Dong and W. Lu, “Imbalance Data Classification Method Based on Improved SMOTE Algorithm and Granular Computing,” in 2022 41st Chinese Control Conference (CCC), IEEE, Jul. 2022, pp. 3196–3201. doi: 10.23919/CCC55666.2022.9902406.
H. A. Gameng, B. B. Gerardo, and R. P. Medina, “Modified Adaptive Synthetic SMOTE to Improve Classification Performance in Imbalanced Datasets,” in 2019 IEEE 6th International Conference on Engineering Technologies and Applied Sciences (ICETAS), IEEE, Dec. 2019, pp. 1–5. doi: 10.1109/ICETAS48360.2019.9117287.
N. Cahyana, S. Khomsah, and agus sasmito Aribowo, “Improving Imbalanced Dataset Classification Using Oversampling and Gradient Boosting,” in 2019 5th International Conference on Science in Information Technology (ICSITech), IEEE, Oct. 2019, pp. 217–222. doi: 10.1109/ICSITech46713.2019.8987499.
F. Mesquita, J. Mauricio, and G. Marques, “Oversampling Techniques for Diabetes Classification: a Comparative Study,” in 2021 International Conference on e-Health and Bioengineering (EHB), IEEE, Nov. 2021, pp. 1–6. doi: 10.1109/EHB52898.2021.9657542.
N. Sigeef, “An Oversampling Algorithm combining SMOTE and RF for Imbalanced Medical Data,” International Journal of Research in Applied Science and Engineering Technology., vol. 11, no. 6, pp. 2429–2434, Jun. 2023, doi: 10.22214/ijraset.2023.54074.
S. A. Alasadi and W. S. Bhaya, “Review of data preprocessing techniques in data mining,” Journal of Engineering and Applied Sciences., vol. 12, no. 16, pp. 4102–4107, 2017, doi: 10.3923/jeasci.2017.4102.4107.
A. Rashid, “Diabetes Dataset,” vol. 1, 2020, doi: 10.17632/WJ9RWKP9C2.1.
A. P. Monika, F. E. P. Risti, I. Binanto, and N. F. Sianipar, “Analisis Perbandingan Algoritma Knn, Gaussian Naive Bayes, Random Forest Untuk Data Tidak Seimbang Dan Data Yang Diseimbangkan Dengan Metode Tomek Link Undersampling Pada Dataset Lcms Tanaman Keladi Tikus,” Pros. Sains Nas. dan Teknol., vol. 13, no. 1, p. 156, 2023, doi: 10.36499/psnst.v13i1.9002.
K. Natarajan, D. Baskaran, and S. Kamalanathan, “An adaptive ensemble feature selection technique for model-agnostic diabetes prediction,” Sci. Rep., vol. 15, no. 1, pp. 1–12, 2025, doi: 10.1038/s41598-025-91282-8.
A. Rakhmadi, A. Yudhana, and S. Sunardi, “A Study Of Worldwide Patterns In Alphabet Sign Language Recognition Using Convolutional And Recurrent Neural Networks,” Jurnal Teknik Informatika., vol. 6, no. 1, pp. 187–204, Feb. 2025, doi: 10.52436/1.jutif.2025.6.1.4202.
A. Yudhana, R. Umar, and S. Saputra, “Fish Freshness Identification Using Machine Learning: Performance Comparison of k-NN and Naïve Bayes Classifier,” Journal of Computer Science and Engineering., vol. 16, no. 3, pp. 153–164, 2022, doi: 10.5626/JCSE.2022.16.3.153.
Sunardi, A. Yudhana, and A. R. W. Putri, “Optimization of Breast Cancer Classification Using Faster R-CNN,” Revue d’Intelligence Artificielle., vol. 37, no. 1, pp. 39–45, Feb. 2023, doi: 10.18280/ria.370106.
J. Chukwura and J. Chukwura Obi, “A comparative study of several classification metrics and their performances on data,” https://wjaets.com/sites/default/files/WJAETS-2023-0054.pdf, vol. 8, no. 1, pp. 308–314, Feb. 2023, doi: 10.30574/WJAETS.2023.8.1.0054.
J. L. Speiser, “A random forest method with feature selection for developing medical prediction models with clustered and longitudinal data,” Journal of Biomedical Informatics., vol. 117, p. 103763, May 2021, doi: 10.1016/j.jbi.2021.103763.
D. R. Rajan, G. V. Sena, R. K, and M. K. Faizan, “Disease Prediction using Machine Learning,” BOHR International Journal of Computer Science., vol. 1, no. 1, pp. 69–72, Jul. 2022, doi: 10.54646/bijcs.2022.11.
K. F. Habie, M. Murinto, and S. Sunardi, “Impact of Optimizer Selection on MobileNetV1 Performance for Skin Disease Detection Using Digital Images,” Jurnal Teknik Informatika., vol. 6, no. 3, pp. 1589–1604, Jul. 2025, doi: 10.52436/1.jutif.2025.6.3.4685.
M. Abdelaoui, “Analysis of the diabetes dataset using a SMOTE machine learning approach,” Stud. Eng. Exact Sci., vol. 5, no. 2, p. e12076, Dec. 2024, doi: 10.54021/seesv5n2-772.
I. Riadi, R. Umar, and R. Anggara, “Prediksi Kelulusan Tepat Waktu Berdasarkan Riwayat Akademik Menggunakan Metode K-Nearest Neighbor,” Jurnal Teknologi Informasi dan Ilmu Komputer., vol. 11, no. 2, pp. 249–256, Apr. 2024, doi: 10.25126/jtiik.20241127330.
S. Helmiyah, I. Riadi, R. Umar, A. Hanif, A. Yudhana, and A. Fadlil, “Identifikasi Emosi Manusia Berdasarkan Ucapan Menggunakan Metode Ekstraksi Ciri LPC dan Metode Euclidean Distance,” Jurnal Teknologi Informasi dan Ilmu Komputer., vol. 7, no. 6, p. 1177, Dec. 2020, doi: 10.25126/jtiik.2020722693.
G. A. Ansari, S. S. Bhat, and M. D. Ansari, “Machine Learning Techniques for Diabetes Mellitus Based on Lifestyle Predictors,” Recent Advances in Electrical and Electronic Engineering. (Formerly Recent Patents Electr. Electron. Eng., vol. 18, no. 7, pp. 1060–1071, Aug. 2025, doi: 10.2174/0123520965291435240508111712.
P. Talari et al., “Hybrid feature selection and classification technique for early prediction and severity of diabetes type 2,” PLoS One, vol. 19, no. 1 January, Jan. 2024, doi: 10.1371/journal.pone.0292100.
E. B. Susanto, A. N. Anzila, and B. Ismanto, “Comparison Of The Effectiveness Of K-Nearest Neighbor (KNN) And Naive Bayes Algorithms In Identifying Diabetes Patients,” Journal of Artificial Intelligence and Software Engineering., vol. 5, no. 1, p. 22, Mar. 2025, doi: 10.30811/jaise.v5i1.6275.
A. R. Mohammed, “Enhancing Diabetes Mellitus Onset Prediction through Advanced Ensemble Learning Techniques,” Journal of Statistical Modeling and Analytics., vol. 6, no. 2, pp. 1–18, Dec. 2024, doi: 10.22452/josma.vol6no2.2.
Additional Files
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Imam Riadi, Anton Yudhana, Gusti Chandra Kurniawan

This work is licensed under a Creative Commons Attribution 4.0 International License.





