Gastroesophageal Reflux Disease Early Detection using XGBoost Method Classifier

Untari Novia  Wisesty; Haura Adzkia  Delfina; Isman  Kurniawan

doi:10.52436/1.jutif.2025.6.2.4143

Authors

Untari Novia Wisesty Informatics, School of Computing, Telkom University, Indonesia
Haura Adzkia Delfina Data Science, School of Computing, Telkom University, Indonesia
Isman Kurniawan Informatics, School of Computing, Telkom University, Indonesia

DOI:

https://doi.org/10.52436/1.jutif.2025.6.2.4143

Keywords:

GERD Detection, Machine Learning, PCA, Pearson Correlation, SMOTE, XGBoost

Abstract

Gastroesophageal reflux disease (GERD) is a clinical condition that occurs when the gastric content within the stomach rises into the esophagus. If left untreated, GERD can result in complications such as esophageal inflammation, ulcers, and even cancer. In this study, the early detection of GERD is performed using the GERD dataset obtained from the Harvard Dataverse online repository and processed with the XGBoost machine learning model. The SMOTE technique was implemented as a solution to address the data imbalance present in the dataset. In addition, this study applied Principal Component Analysis (PCA) and Pearson Correlation to select the most relevant attributes, with the aim of improving computational efficiency. The results demonstrated that feature selection through Pearson correlation and feature extraction using principal component analysis (PCA) yielded the optimal model performance when utilizing 16 attributes and 16 principal components, respectively. The XGBoost model with PCA achieves a macro average F1-score of 0.9615, while the XGBoost model with Pearson Correlation attains a value of 0.9809. Subsequently, the XGBoost model based on the original dataset yielded a macro F1-score value of 0.9568. The findings of this research indicate that the XGBoost model with the Pearson Correlation-based feature selection method has a better f1-score value than the feature extraction method with PCA or based on the original dataset with a difference in value of 0.0194 and 0.0241 respectively in enhancing the performance of the XGBoost model for early detection of GERD in this study.

Downloads

Download data is not yet available.

References

K. H. A. Boulton and P. W. Dettmar, “A narrative review of the prevalence of gastroesophageal reflux disease (GERD),” Ann Esophagus, vol. 5, no. 57–66, pp. 7–7, Mar. 2022, doi: 10.21037/aoe-20-80.

M. Durazzo et al., “Extra-Esophageal Presentation of Gastroesophageal Reflux Disease: 2020 Update,” JCM, vol. 9, no. 8, p. 2559, Aug. 2020, doi: 10.3390/jcm9082559.

J. Maret-Ouda, S. R. Markar, and J. Lagergren, “Gastroesophageal Reflux Disease: A Review,”

JAMA, vol. 324, no. 24, p. 2536, Dec. 2020, doi: 10.1001/jama.2020.21360.

A. Ravindran and P. G. Iyer, “Gastroesophageal Reflux Disease and Complications,” in Geriatric Gastroenterology, C. S. Pitchumoni and T. S. Dharmarajan, Eds., Cham: Springer International Publishing, 2020, pp. 1–17. doi: 10.1007/978-3-319-90761-1_42-1.

R. Fass, “Gastroesophageal Reflux Disease,” N Engl J Med, vol. 387, no. 13, pp. 1207–1216, Sep. 2022, doi: 10.1056/NEJMcp2114026.

R. Fass, G. E. Boeckxstaens, H. El-Serag, R. Rosen, D. Sifrim, and M. F. Vaezi, “Gastro- oesophageal reflux disease,” Nat Rev Dis Primers, vol. 7, no. 1, p. 55, Jul. 2021, doi: 10.1038/s41572-021-00287-w.

D. A. Katzka and P. J. Kahrilas, “Advances in the diagnosis and management of gastroesophageal reflux disease,” BMJ, p. m3786, Nov. 2020, doi: 10.1136/bmj.m3786.

J. S. Nirwan, S. S. Hasan, Z.-U.-D. Babar, B. R. Conway, and M. U. Ghori, “Global Prevalence and Risk Factors of Gastro-oesophageal Reflux Disease (GORD): Systematic Review with Meta- analysis,” Sci Rep, vol. 10, no. 1, p. 5814, Apr. 2020, doi: 10.1038/s41598-020-62795-1.

M. M. Ahsan, S. A. Luna, and Z. Siddique, “Machine-Learning-Based Disease Diagnosis: A Comprehensive Review,” Healthcare, vol. 10, no. 3, p. 541, Mar. 2022, doi: 10.3390/healthcare10030541.

A. Maydeo et al., “Impact of Mobile Endoscopy Unit for Rendering Gastrointestinal Endoscopy Services at Two Community Health Centers in Western India,” Journal of Digestive Endoscopy, vol. 12, no. 04, pp. 190–195, Dec. 2021, doi: 10.1055/s-0041-1741387.

X. Pei, Q. Deng, Z. Liu, X. Yan, and W. Sun, “Machine Learning Algorithms for Predicting Fatty Liver Disease,” Ann Nutr Metab, vol. 77, no. 1, pp. 38–45, 2021, doi: 10.1159/000513654.

Z. Ahmed, K. Mohamed, S. Zeeshan, and X. Dong, “Artificial intelligence with multi-functional machine learning platform development for better healthcare and precision medicine,” Database, vol. 2020, p. baaa010, Jan. 2020, doi: 10.1093/database/baaa010.

S. Khandakar, “Unveiling Early Detection And Prevention Of Cancer: Machine Learning And Deep Learning Approaches:,” EATP, vol. 30, no. 5, pp. 14614–14628, May 2024, doi: 10.53555/kuey.v30i5.7014.

M. Shehab et al., “Machine learning in medical applications: A review of state-of-the-art methods,” Computers in Biology and Medicine, vol. 145, p. 105458, Jun. 2022, doi: 10.1016/j.compbiomed.2022.105458.

S. Shamshirband, M. Fathi, A. Dehzangi, A. T. Chronopoulos, and H. Alinejad-Rokny, “A review on deep learning approaches in healthcare systems: Taxonomies, challenges, and open issues,” Journal of Biomedical Informatics, vol. 113, p. 103627, Jan. 2021, doi: 10.1016/j.jbi.2020.103627.

F. Mazhar, M. Sajid, N. Aslam, M. Imran, and H. Ahmad, “Boosting Early Diabetes Detection: An Ensemble Learning Approach with XGBoost and LightGBM,” JCBI, vol. 6, no. 02, Mar. 2024, doi: https://doi.org/10.56979/602/2024.

S. Kabiraj et al., “Breast Cancer Risk Prediction using XGBoost and Random Forest Algorithm,” in 2020 11th International Conference on Computing, Communication and Networking Technologies (ICCCNT), Kharagpur, India: IEEE, Jul. 2020, pp. 1–4. doi: 10.1109/ICCCNT49239.2020.9225451.

Srividya B. V. and S. Sasi, “Early Detection of Gastroesophageal Reflux Disease Using Logistic Regression and Support Vector Machine:,” International Journal of Organizational and Collective Intelligence, vol. 11, no. 2, pp. 75–90, Apr. 2021, doi: 10.4018/IJOCI.2021040104.

J. H. Rubenstein et al., “Predicting Incident Adenocarcinoma of the Esophagus or Gastric Cardia Using Machine Learning of Electronic Health Records,” Gastroenterology, vol. 165, no. 6, pp. 1420-1429.e10, Dec. 2023, doi: 10.1053/j.gastro.2023.08.011.

N. Wickramasinghe et al., “The association between symptoms of gastroesophageal reflux disease and perceived stress: A countrywide study of Sri Lanka,” PLoS ONE, vol. 18, no. 11, p. e0294135, Nov. 2023, doi: 10.1371/journal.pone.0294135.

J. Wu, Y. Li, and Y. Ma, “Comparison of XGBoost and the Neural Network model on the class- balanced datasets,” in 2021 IEEE 3rd International Conference on Frontiers Technology of Information and Computer (ICFTIC), Greenville, SC, USA: IEEE, Nov. 2021, pp. 457–461. doi: 10.1109/ICFTIC54370.2021.9647373.

D. Tarwidi, S. R. Pudjaprasetya, D. Adytia, and M. Apri, “An optimized XGBoost-based machine learning method for predicting wave run-up on a sloping beach,” MethodsX, vol. 10, p. 102119, 2023, doi: 10.1016/j.mex.2023.102119.

Z. Arif Ali, Z. H. Abduljabbar, H. A. Tahir, A. Bibo Sallow, and S. M. Almufti, “eXtreme Gradient Boosting Algorithm with Machine Learning: a Review,” ACAD J NAWROZ UNIV, vol. 12, no. 2, pp. 320–334, May 2023, doi: 10.25007/ajnu.v12n2a1612.

Md. J. Raihan, Md. A.-M. Khan, S.-H. Kee, and A.-A. Nahid, “Detection of the chronic kidney disease using XGBoost classifier and explaining the influence of the attributes on the model using SHAP,” Sci Rep, vol. 13, no. 1, p. 6263, Apr. 2023, doi: 10.1038/s41598-023-33525-0.

S. Zhang, Y. Yuan, Z. Yao, X. Wang, and Z. Lei, “Improvement of the Performance of Models for Predicting Coronary Artery Disease Based on XGBoost Algorithm and Feature Processing Technology,” Electronics, vol. 11, no. 3, p. 315, Jan. 2022, doi: 10.3390/electronics11030315.

R. Magdum, “What is Data Exploration? and its Importance in Data Analytics,” IRJET, vol. 09, no. 01, pp. 1482–1485, 2022.

A. D. Gupta, K. Singh, K. D. Singh, P. Kushwaha, B. P. Lohani, and S. Kumar, “Unveiling Insights: Exploring Healthcare Data through Data Analysis,” in 2024 International Conference on Communication, Computer Sciences and Engineering (IC3SE), Gautam Buddha Nagar, India: IEEE, May 2024, pp. 575–581. doi: 10.1109/IC3SE62002.2024.10593333.

O. Sami, Y. Elsheikh, and F. Almasalha, “The Role of Data Pre-processing Techniques in Improving Machine Learning Accuracy for Predicting Coronary Heart Disease,” IJACSA, vol. 12, no. 6, 2021, doi: 10.14569/IJACSA.2021.0120695.

F. Xiong, C. Cao, M. Tang, Z. Wang, J. Tang, and J. Yi, “Fault Detection of UHV Converter Valve Based on Optimized Cost-Sensitive Extreme Random Forest,” Energies, vol. 15, no. 21,

p. 8059, Oct. 2022, doi: 10.3390/en15218059.

S. Justin, W. Saleh, T. Al Ghamdi, and J. Shermina, “Hyperparameter Optimization Based Deep Belief Network for Clean Buses Using Solar Energy Model,” Intelligent Automation & Soft Computing, vol. 37, no. 1, pp. 1091–1109, 2023, doi: 10.32604/iasc.2023.032589.

V. R. Joseph, “Optimal ratio for data splitting,” Statistical Analysis, vol. 15, no. 4, pp. 531–538, Aug. 2022, doi: 10.1002/sam.11583.

S. Matharaarachchi, M. Domaratzki, and S. Muthukumarana, “Minimizing features while maintaining performance in data classification problems,” PeerJ Computer Science, vol. 8, p. e1081, Sep. 2022, doi: 10.7717/peerj-cs.1081.

H. Mamdouh Farghaly and T. Abd El-Hafeez, “A high-quality feature selection method based on frequent and correlated items for text classification,” Soft Comput, vol. 27, no. 16, pp. 11259– 11274, Aug. 2023, doi: 10.1007/s00500-023-08587-x.

Z. M. Zain et al., “Predicting breast cancer recurrence using principal component analysis as feature extraction: an unbiased comparative analysis,” Int. J. Adv. Intell. Informatics, vol. 6, no. 3, p. 313, Nov. 2020, doi: 10.26555/ijain.v6i3.462.

F. Bagherzadeh, “Comparative study on total nitrogen prediction in wastewater treatment plant and effect of various feature selection methods on machine learning algorithms performance,” Journal of Water Process Engineering, vol. 41, p. 102033, 2021, doi: https://doi.org/10.1016/j.jwpe.2021.102033.

F. Zinzendoff Okwonu, B. Laro Asaju, and F. Irimisose Arunaye, “Breakdown Analysis of Pearson Correlation Coefficient and Robust Correlation Methods,” IOP Conf. Ser.: Mater. Sci. Eng., vol. 917, no. 1, p. 012065, Sep. 2020, doi: 10.1088/1757-899X/917/1/012065.

U. N. Wisesty, T. A. B. Wirayuda, F. Sthevanie, and R. Rismala, “Analysis of Data and Feature Processing on Stroke Prediction using Wide Range Machine Learning Model,” join, vol. 9, no. 1, pp. 29–40, Apr. 2024, doi: 10.15575/join.v9i1.1249.

I. Świetlicka, W. Kuniszyk-Jóźkowiak, and M. Świetlicki, “Artificial Neural Networks Combined with the Principal Component Analysis for Non-Fluent Speech Recognition,” Sensors, vol. 22, no. 1, p. 321, Jan. 2022, doi: 10.3390/s22010321.

R. Wardoyo, I. M. A. Wirawan, and I. G. A. Pradipta, “Oversampling Approach Using Radius- SMOTE for Imbalance Electroencephalography Datasets,” Emerg Sci J, vol. 6, no. 2, pp. 382– 398, Mar. 2022, doi: 10.28991/ESJ-2022-06-02-013.

S. Yang and G. Berdine, “Confusion matrix,” The Chronicles, vol. 12, no. 53, pp. 75–79, Oct. 2024, doi: 10.12746/swrccc.v12i53.1391.