Improving Extreme Gradient Boosting Model for Heart Disease Prediction Using SMOTE for Class Imbalance

Dini Rohmayani; Castaka Agus Sugianto; Rangga Satria Perdana; Mohammed Mansoor  Nafea

doi:10.52436/1.jutif.2025.6.4.4753

Authors

Dini Rohmayani Informatics Engineering, Politeknik TEDC, Indoneisa
Castaka Agus Sugianto Informatics Engineering, Politeknik TEDC, Indoneisa
Rangga Satria Perdana Information Systems, Universitas Sangga Buana, Indonesia
Mohammed Mansoor Nafea Computer Engineering Techniques, College of Technical Engineering University of Al Maarif, Iraq

DOI:

https://doi.org/10.52436/1.jutif.2025.6.4.4753

Keywords:

Heart Disease, Machine Learning, SMOTE, Streamlit, XGBoost

Abstract

The goal of this study is to come up with an intelligent predictive model that can classify the severity of heart disease. The model will employ both XGBoost and oversampling to resolve the problem of data imbalance. In addition, the model will be implemented for real-world application using an interactive interface. The study uses the UCI Heart Disease dataset, which includes many clinical features. Preprocessing involves handling missing values, removal of features with a substantial fraction of missing values, and the use of SMOTE resampling for learning from class-balanced instances. The main classifier that was used for the research purposes was the XGBoost classifier, while the dataset was split 80:20 for training and testing purposes. For ease of individual-level real-time testing of the predictions, the model is implemented through Streamlit. The XGBoost model worked extraordinarily well, with the accuracy standing at 92%, as did precision along with recall, as well as the F1-score, being 92%. These findings clearly outperform other current studies of the same sort that have made use of alternative classifiers. In addition, its deployment using Streamlit makes it even more clinically applicable. Innovation The novelty of the research lies in the combined application of SMOTE with XGBoost, enabling effective classification under imbalanced conditions, along with the real-time implementation using Streamlit for user-level predictions. The model is of high value for early identification and stratification of the severity of heart disease in clinical decision support settings.

Downloads

Download data is not yet available.

References

A. V. Poznyak, L. Litvinova, P. Poggio, V. N. Sukhorukov, and A. N. Orekhov, “Effect of Glucose Levels on Cardiovascular Risk,” Cells, vol. 11, no. 19, p. 3034, Sep. 2022, doi: 10.3390/cells11193034.

Y. Wang and D. J. Magliano, “Special Issue: ‘New Trends in Diabetes, Hypertension, and Cardiovascular Diseases,’” Mar. 01, 2024, Multidisciplinary Digital Publishing Institute (MDPI). doi: 10.3390/ijms25052711.

Md. A. Rahman, S. Cronmiller, Y. Shanjana, M. A. Bhuiyan, and Md. R. Islam, “The WHO announced COVID-19 is no longer a global public health emergency amid the spreading of arcturus variant: a correspondence evaluating this decision,” International Journal of Surgery, May 2023, doi: 10.1097/JS9.0000000000000522.

World Health Organization, “World Heart Day: Cardiovascular diseases claim 3.9 million lives in the WHO South-East Asia Region every year.” Accessed: Apr. 17, 2025. [Online]. Available: https://www.who.int/southeastasia/news/detail/29-09-2024-world-heart-day

R. G. Russo et al., “COVID‐19, Social Determinants of Health, and Opportunities for Preventing Cardiovascular Disease: A Conceptual Framework,” J Am Heart Assoc, vol. 10, no. 24, p. e022721, Dec. 2021, doi: 10.1161/JAHA.121.022721.

F. Sapna et al., “Advancements in Heart Failure Management: A Comprehensive Narrative Review of Emerging Therapies,” Cureus, Oct. 2023, doi: 10.7759/cureus.46486.

N. Chandrasekhar and S. Peddakrishna, “Enhancing Heart Disease Prediction Accuracy through Machine Learning Techniques and Optimization,” Processes, vol. 11, no. 4, Apr. 2023, doi: 10.3390/pr11041210.

C. M. Bhatt, P. Patel, T. Ghetia, and P. L. Mazzeo, “Effective Heart Disease Prediction Using Machine Learning Techniques,” Algorithms, vol. 16, no. 2, Feb. 2023, doi: 10.3390/a16020088.

B. Arjmand et al., “Machine Learning: A New Prospect in Multi-Omics Data Analysis of Cancer,” Jan. 27, 2022, Frontiers Media S.A. doi: 10.3389/fgene.2022.824451.

J. Wen, Z. Zhang, Y. Lan, Z. Cui, J. Cai, and W. Zhang, “A survey on federated learning: challenges and applications,” International Journal of Machine Learning and Cybernetics, vol. 14, no. 2, pp. 513–535, Feb. 2023, doi: 10.1007/s13042-022-01647-y.

N. R. D. Cahyo, C. A. Sari, E. H. Rachmawanto, C. Jatmoko, R. R. A. Al-Jawry, and M. A. Alkhafaji, “A Comparison of Multi Class Support Vector Machine vs Deep Convolutional Neural Network for Brain Tumor Classification,” in 2023 International Seminar on Application for Technology of Information and Communication (iSemantic), IEEE, Sep. 2023, pp. 358–363. doi: 10.1109/iSemantic59612.2023.10295336.

I. P. Kamila, C. A. Sari, E. H. Rachmawanto, and N. R. D. Cahyo, “A Good Evaluation Based on Confusion Matrix for Lung Diseases Classification using Convolutional Neural Networks,” Advance Sustainable Science, Engineering and Technology, vol. 6, no. 1, p. 0240102, Dec. 2023, doi: 10.26877/asset.v6i1.17330.

X. Jiang and Z. Ge, “Data augmentation classifier for imbalanced fault classification,” IEEE Transactions on Automation Science and Engineering, vol. 18, no. 3, pp. 1206–1217, 2020.

S. J. Basha, S. R. Madala, K. Vivek, E. S. Kumar, and T. Ammannamma, “A Review on Imbalanced Data Classification Techniques,” in 2022 International Conference on Advanced Computing Technologies and Applications (ICACTA), IEEE, Mar. 2022, pp. 1–6. doi: 10.1109/ICACTA54488.2022.9753392.

A. J. Albert, R. Murugan, and T. Sripriya, “Diagnosis of heart disease using oversampling methods and decision tree classifier in cardiology,” Research on Biomedical Engineering, vol. 39, no. 1, pp. 99–113, Dec. 2022, doi: 10.1007/s42600-022-00253-9.

U. Hasanah, A. M. Soleh, and K. Sadik, “Effect of Random Under sampling, Oversampling, and SMOTE on the Performance of Cardiovascular Disease Prediction Models,” Jurnal Matematika, Statistika dan Komputasi, vol. 21, no. 1, pp. 88–102, Sep. 2024, doi: 10.20956/j.v21i1.35552.

M. F. Muzakki, R. D. Prayogo, and M. A. Rizky A, “Handling Imbalanced Data for Acute Coronary Syndrome Classification Based on Ensemble and K-Means SMOTE Method,” JOIV : International Journal on Informatics Visualization, vol. 7, no. 3–2, p. 1989, Nov. 2023, doi: 10.30630/joiv.7.3-2.1429.

P. Li, X. Rao, J. Blase, Y. Zhang, X. Chu, and C. Zhang, “CleanML: A Study for Evaluating the Impact of Data Cleaning on ML Classification Tasks,” in 2021 IEEE 37th International Conference on Data Engineering (ICDE), IEEE, Apr. 2021, pp. 13–24. doi: 10.1109/ICDE51399.2021.00009.

A. Palanivinayagam and R. Damaševičius, “Effective Handling of Missing Values in Datasets for Classification Using Machine Learning Methods,” Information, vol. 14, no. 2, p. 92, Feb. 2023, doi: 10.3390/info14020092.

Z. Abidin, T. Tamrin, V. Harsono, D. N. Aziza, and I. Kansania, “OPTIMALISASI DIAGNOSIS STROKE DENGAN ALGORITMA C4.5 DAN STRATEGI IMPUTASI k-NN UNTUK MENGATASI MISSING VALUE,” Jurnal Disprotek, vol. 15, no. 2, pp. 152–160, Aug. 2024, doi: 10.34001/jdpt.v15i2.6701.

P. Dabhade, R. Agarwal, K. P. Alameen, A. T. Fathima, R. Sridharan, and G. Gopakumar, “Educational data mining for predicting students’ academic performance using machine learning algorithms,” Mater Today Proc, vol. 47, pp. 5260–5267, 2021, doi: 10.1016/j.matpr.2021.05.646.

F. R. Adi Pratama and S. I. Oktora, “Synthetic Minority Over-sampling Technique (SMOTE) for handling imbalanced data in poverty classification,” Stat J IAOS, vol. 39, no. 1, pp. 233–239, Mar. 2023, doi: 10.3233/SJI-220080.

A. D. Amirruddin, F. M. Muharam, M. H. Ismail, N. P. Tan, and M. F. Ismail, “Synthetic Minority Over-sampling TEchnique (SMOTE) and Logistic Model Tree (LMT)-Adaptive Boosting algorithms for classifying imbalanced datasets of nutrient and chlorophyll sufficiency levels of oil palm (Elaeis guineensis) using spectroradiometers and unmanned aerial vehicles,” Comput Electron Agric, vol. 193, p. 106646, Feb. 2022, doi: 10.1016/j.compag.2021.106646.

A. Farzipour, R. Elmi, and H. Nasiri, “Detection of Monkeypox Cases Based on Symptoms Using XGBoost and Shapley Additive Explanations Methods,” Diagnostics, vol. 13, no. 14, p. 2391, Jul. 2023, doi: 10.3390/diagnostics13142391.

J. Ou et al., “Coupling UAV Hyperspectral and LiDAR Data for Mangrove Classification Using XGBoost in China’s Pinglu Canal Estuary,” Forests, vol. 14, no. 9, p. 1838, Sep. 2023, doi: 10.3390/f14091838.

N. R. D. Cahyo and M. M. I. Al-Ghiffary, “An Image Processing Study: Image Enhancement, Image Segmentation, and Image Classification using Milkfish Freshness Images,” IJECAR) International Journal of Engineering Computing Advanced Research, vol. 1, no. 1, pp. 11–22, 2024.

F. Farhan, C. A. Sari, E. H. Rachmawanto, and N. R. D. Cahyo, “Mangrove Tree Species Classification Based on Leaf, Stem, and Seed Characteristics Using Convolutional Neural Networks with K-Folds Cross Validation Optimalization,” Advance Sustainable Science Engineering and Technology, vol. 5, no. 3, p. 02303011, Oct. 2023, doi: 10.26877/asset.v5i3.17188.

M. M. I. Al-Ghiffary, C. A. Sari, E. H. Rachmawanto, N. M. Yacoob, N. R. D. Cahyo, and R. R. Ali, “Milkfish Freshness Classification Using Convolutional Neural Networks Based on Resnet50 Architecture,” Advance Sustainable Science Engineering and Technology, vol. 5, no. 3, p. 0230304, Oct. 2023, doi: 10.26877/asset.v5i3.17017.

R. R. Ali et al., “Learning Architecture for Brain Tumor Classification Based on Deep Convolutional Neural Network: Classic and ResNet50,” Diagnostics, vol. 15, no. 5, p. 624, Mar. 2025, doi: 10.3390/diagnostics15050624.

E. H. Rachmawanto, C. A. Sari, and F. O. Isinkaye, “A good result of brain tumor classification based on simple convolutional neural network architecture,” Telkomnika (Telecommunication Computing Electronics and Control), vol. 22, no. 3, pp. 711–719, Jun. 2024, doi: 10.12928/TELKOMNIKA.v22i3.25863.