A Random Forest and SMOTE-Based Machine Learning Model for Predicting Recurrence in Papillary Thyroid Carcinoma

Edi Jaya Kusuma; Ririn Nurmandhani; Ika Pantiawati; Yusthin Meriantti Manglapy; Evina Widianawati

doi:10.52436/1.jutif.2025.6.4.4854

Authors

Edi Jaya Kusuma Faculty of Health Science, Universitas Dian Nuswantoro, Indonesia
Ririn Nurmandhani Faculty of Health Science, Universitas Dian Nuswantoro, Indonesia
Ika Pantiawati Faculty of Health Science, Universitas Dian Nuswantoro, Indonesia
Yusthin Meriantti Manglapy Faculty of Health Science, Universitas Dian Nuswantoro, Indonesia
Evina Widianawati Student of Department of Biomedical Engineering, Chung Yuan Christian University, Taoyuan City, Taiwan

DOI:

https://doi.org/10.52436/1.jutif.2025.6.4.4854

Keywords:

Class Imbalance, Clinical Decision Support, Machine Learning, Papillary Thyroid Carcinoma, SMOTE

Abstract

PTC (Papillary Thyroid Carcinoma) is one subtype of thyroid cancer occurred most frequently in thyroid cancer cases. Although the prognosis of this cancer is typically positive, its recurrence remains a key challenge requiring early detection. This study proposes machine learning models to predict PTC recurrence, explicitly addressing the inherent class imbalance in the recurrence data. This study implemented three supervised learning algorithms, namely Random Forest (RF), Extreme Gradient Boost (XGB), and Support Vector Machine (SVM) with the Synthetic Minority Oversampling Technique (SMOTE) to balance the dataset. SMOTE was chosen for its capacity to generate synthetic minority class samples while minimizing information loss, thus effectively addressing class imbalance and improving classification outcomes. Model performance was assessed using accuracy, precision, recall (sensitivity), and F1-score. Among all approaches tested, RF with SMOTE demonstrated superior performance, achieving 0.98 accuracy, perfect precision (1.0), high recall (sensitivity) (0.95), and a strong F1-score (0.97), outperforming previous methods including SMOTEENN-based approaches. The result of this study demonstrates SMOTE specifically outperforms SMOTEENN in this clinical context, likely due to better preservation of subtle prognostic indicators with minimal information loss. This improvement suggests SMOTE's effectiveness in preserving valuable decision boundary information while addressing class imbalance in PTC recurrence prediction. These findings establish RF with SMOTE as a robust and well-balanced approach for predicting PTC recurrence, contributing significantly to the development of more precise and responsive AI-driven decision support tools for thyroid cancer.

Downloads

Download data is not yet available.

References

S. Yao and H. Zhang, “Papillary thyroid carcinoma with Hashimoto’s thyroiditis: impact and correlation,” Front. Endocrinol. (Lausanne)., vol. 16, Apr. 2025.

Y. Ito, M. Yamamoto, M. Kihara, N. Onoda, A. Miya, and A. Miyauchi, “Establishment of novel prognostic groups for papillary thyroid carcinoma using a modified risk classification based on tumor extension in the guidelines of the Japan Association of Endocrine Surgery,” Endocr. J., vol. 72, no. 6, pp. EJ24-0610, 2025.

A. A. Póvoa et al., “Clinicopathological Features as Prognostic Predictors of Poor Outcome in Papillary Thyroid Carcinoma,” Cancers (Basel)., vol. 12, no. 11, p. 3186, Oct. 2020.

J. Zhang and S. Xu, “High aggressiveness of papillary thyroid cancer: from clinical evidence to regulatory cellular networks,” Cell Death Discov., vol. 10, no. 1, p. 378, Aug. 2024.

H. Zhong, Q. Zeng, X. Long, Y. Lai, J. Chen, and Y. Wang, “Risk factors analysis of lateral cervical lymph node metastasis in papillary thyroid carcinoma: a retrospective study of 830 patients,” World J. Surg. Oncol., vol. 22, no. 1, p. 162, Jun. 2024.

J. Yu et al., “Lymph node metastasis prediction of papillary thyroid carcinoma based on transfer learning radiomics,” Nat. Commun., vol. 11, no. 1, pp. 1–10, 2020.

S. Borzooei, G. Briganti, M. Golparian, J. R. Lechien, and A. Tarokhian, “Machine learning for risk stratification of thyroid cancer patients: a 15-year cohort study,” Eur. Arch. Oto-Rhino-Laryngology, vol. 281, no. 4, pp. 2095–2104, 2024.

Y. M. Park and B.-J. Lee, “Machine learning-based prediction model using clinico-pathologic factors for papillary thyroid carcinoma recurrence,” Sci. Rep., vol. 11, no. 1, p. 4948, Mar. 2021.

J. Pardede and D. P. Pamungkas, “The Impact of Balanced Data Techniques on Classification Model Performance,” Sci. J. Informatics, vol. 11, no. 2, pp. 401–412, 2024.

H. Hairani, T. Widiyaningtyas, and D. Dwi Prasetya, “Addressing Class Imbalance of Health Data: A Systematic Literature Review on Modified Synthetic Minority Oversampling Technique (SMOTE) Strategies,” JOIV Int. J. Informatics Vis., vol. 8, no. 3, p. 1310, Sep. 2024.

S. S. Aljameel, “A Proactive Explainable Artificial Neural Network Model for the Early Diagnosis of Thyroid Cancer,” Computation, vol. 10, no. 10, p. 183, Oct. 2022.

R. Bounab, B. Guelib, and K. Zarour, “A Novel Machine Learning Approach For handling Imbalanced Data: Leveraging SMOTE-ENN and XGBoost,” in 2024 6th International Conference on Pattern Analysis and Intelligent Systems (PAIS), 2024, pp. 1–7.

Z. Xu, D. Shen, T. Nie, and Y. Kou, “A hybrid sampling algorithm combining M-SMOTE and ENN based on Random forest for medical imbalanced data,” J. Biomed. Inform., vol. 107, p. 103465, Jul. 2020.

I. M. Alkhawaldeh, I. Albalkhi, and A. J. Naswhan, “Challenges and limitations of synthetic minority oversampling techniques in machine learning,” World J. Methodol., vol. 13, no. 5, pp. 373–378, Dec. 2023.

Y. Jang, “Feature-based ensemble modeling for addressing diabetes data imbalance using the SMOTE, RUS, and random forest methods: a prediction study,” Ewha Med. J., vol. 48, no. 2, p. e32, Apr. 2025.

M. Kashina, I. D. Lenivtceva, and G. D. Kopanitsa, “Preprocessing of unstructured medical data: the impact of each preprocessing stage on classification,” Procedia Comput. Sci., vol. 178, pp. 284–290, 2020.

F. Bolikulov, R. Nasimov, A. Rashidov, F. Akhmedov, and Y.-I. Cho, “Effective Methods of Categorical Data Encoding for Artificial Intelligence Algorithms,” Mathematics, vol. 12, no. 16, p. 2553, Aug. 2024.

C. Herdian, A. Kamila, and I. G. Agung Musa Budidarma, “Studi Kasus Feature Engineering Untuk Data Teks: Perbandingan Label Encoding dan One-Hot Encoding Pada Metode Linear Regresi,” Technol. J. Ilm., vol. 15, no. 1, p. 93, Jan. 2024.

Z. Lu, Y. Liu, and Q. Li, “A Research on the Academic System in Universities Based on the One-Hot Encoding PAC Fuzzy Comprehensive Evaluation Algorithm,” in Proceedings of Innovative Computing 2024, 2024, pp. 224–235.

A. M. Sowjanya and O. Mrudula, “Effective treatment of imbalanced datasets in health care using modified SMOTE coupled with stacked deep learning algorithms,” Appl. Nanosci., vol. 13, no. 3, pp. 1829–1840, Mar. 2023.

M. Waqar, H. Dawood, H. Dawood, N. Majeed, A. Banjar, and R. Alharbey, “An Efficient SMOTE-Based Deep Learning Model for Heart Attack Prediction,” Sci. Program., vol. 2021, pp. 1–12, Mar. 2021.

H. Nizam‐Ozogur and Z. Orman, “A heuristic‐based hybrid sampling method using a combination of SMOTE and ENN for imbalanced health data,” Expert Syst., vol. 41, no. 8, Aug. 2024.

L. Breiman, “Random Forests,” Mach. Learn., vol. 45, pp. 5–32, 2001.

I. A. Hidayat, “Classification of Sleep Disorders Using Random Forest on Sleep Health and Lifestyle Dataset,” J. Dinda Data Sci. Inf. Technol. Data Anal., vol. 3, no. 2, pp. 71–76, Aug. 2023.

S. K. Tadepalli and P. P. V. Lakshmi, “An Entropy enabled Random Forest Neural Network Algorithm to Grade the Reproductive System for Efficient Early Detection of Infertility,” in 2023 IEEE 5th International Conference on Cybernetics, Cognition and Machine Learning Applications (ICCCMLA), 2023, pp. 95–100.

S. A. Domínguez-Miranda, R. Rodriguez-Aguilar, and M. Velazquez-Salazar, “Modeling the Relation Between Non-Communicable Diseases and the Health Habits of the Mexican Working Population: A Hybrid Modeling Approach,” Mathematics, vol. 13, no. 6, p. 959, Mar. 2025.

W. Zhao, J. Li, J. Zhao, D. Zhao, J. Lu, and X. Wang, “XGB model: Research on evaporation duct height prediction based on XGBoost algorithm,” Radioengineering, vol. 29, no. 1, pp. 81–93, 2020.

P. Zhang, Y. Jia, and Y. Shang, “Research and application of XGBoost in imbalanced data,” Int. J. Distrib. Sens. Networks, vol. 18, no. 6, p. 155013292211069, Jun. 2022.

E. J. Kusuma, R. Nurmandhani, L. Aryani, I. Pantiawati, and G. F. Shidik, “Optimasi Model Extreme Gradient Boosting Dalam Upaya Penentuan Tingkat Risiko Pada Ibu Hamil Berbasis Bayesian Optimization (BOXGB),” J. Teknol. Inf. dan Ilmu Komput., vol. 12, no. 1, pp. 111–120, Feb. 2025.

N. Amaya-Tejera, M. Gamarra, J. I. Vélez, and E. Zurek, “A distance-based kernel for classification via Support Vector Machines,” Front. Artif. Intell., vol. 7, Feb. 2024.

H. W. Gichuhi, M. Magumba, M. Kumar, and R. W. Mayega, “A machine learning approach to explore individual risk factors for tuberculosis treatment non-adherence in Mukono district,” PLOS Glob. Public Heal., vol. 3, no. 7, p. e0001466, 2023.

R. Bouchouareb and K. Ferroudji, “Classification of ECG Arrhythmia using Artificial Intelligence techniques (RBF and SVM),” in 2022 4th International Conference on Pattern Analysis and Intelligent Systems (PAIS), 2022, pp. 1–7.

E. J. Kusuma, I. Pantiawati, and S. Handayani, “Melanoma Classification based on Simulated Annealing Optimization in Neural Network,” Knowl. Eng. Data Sci., vol. 4, no. 2, p. 97, Mar. 2022.

H. Schäfer et al., “The Value of Clinical Decision Support in Healthcare: A Focus on Screening and Early Detection,” Diagnostics, vol. 15, no. 5, p. 648, Mar. 2025.

A. S. Albahri et al., “A systematic review of trustworthy and explainable artificial intelligence in healthcare: Assessment of quality, bias risk, and data fusion,” Inf. Fusion, vol. 96, pp. 156–191, Aug. 2023.

H. Wang et al., “Development and validation of prediction models for papillary thyroid cancer structural recurrence using machine learning approaches,” BMC Cancer, vol. 24, no. 1, pp. 1–12, 2024.

V. Kumar et al., “Addressing Binary Classification over Class Imbalanced Clinical Datasets Using Computationally Intelligent Techniques,” Healthcare, vol. 10, no. 7, p. 1293, Jul. 2022.