Optimized KNN Performance with PCA and K-Fold Cross-Validation for Colorectal Cancer Survival Prediction

Authors

  • Yuke Manza Computer Science, Universitas Potensi Utama, Indonesia
  • Rika Rosnelly Computer Science, Universitas Potensi Utama, Indonesia
  • Mhd Furqan Computer Science, Universitas Potensi Utama, Indonesia
  • Bob Subhan Reza Computer Science, Universitas Potensi Utama, Indonesia

DOI:

https://doi.org/10.52436/1.jutif.2026.7.1.5422

Keywords:

Colorectal cancer prediction, K-fold cross validation, K-Nearest Neighbors, Machine learning, Principal Component Analysis, Survival prediction

Abstract

Colorectal cancer remains a leading cause of global mortality, necessitating effective predictive tools for patient survival. While Machine Learning algorithms like K-Nearest Neighbors (KNN) utilize patient data for prediction, standard KNN implementations often suffer from the curse of dimensionality and overfitting, leading to unreliable performance on complex medical datasets. This study aims to evaluate and optimize the performance of the KNN algorithm by integrating Principal Component Analysis (PCA) for dimensionality reduction and K-Fold Cross-Validation (KFCV) to enhance model stability. The research utilized a quantitative approach on a global colorectal cancer dataset, processing demographic and clinical features through a rigorous pipeline of imputation, encoding, and normalization. Three model configurations were systematically compared: Standard KNN, KNN combined with PCA, and an optimized KNN model utilizing both PCA and KFCV across various neighbor values. The results demonstrate a distinct trade-off between predictive sensitivity and model stability. While the Standard KNN and PCA-enhanced models achieved higher recall, indicating a strong ability to identify survivors in a single data split, the fully optimized KNN+PCA+KFCV model provided the most stable and generalized accuracy with minimal deviation. These findings indicate that while PCA effectively reduces computational complexity without information loss, the integration of cross-validation is crucial for obtaining an honest assessment of model performance. This research contributes to clinical informatics by highlighting the necessity of prioritization between high sensitivity and generalization stability when developing survival prediction models for complex, inseparable medical data.

Downloads

Download data is not yet available.

References

A. Lewandowska, G. Rudzki, T. Lewandowski, A. Stryjkowska-Góra, and S. Rudzki, “Risk factors for the diagnosis of colorectal cancer,” Cancer Control, vol. 29, p. 10732748211056692, 2022. [Online]. Available: https://journals.sagepub.com/doi/full/10.1177/10732748211056692

Y. Xi and P. Xu, “Global colorectal cancer burden in 2020 and projections to 2040,” Transl. Oncol., vol. 14, no. 10, p. 101174, Oct. 2021. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S1936523321001662

S. Shinji et al., “Recent advances in the treatment of colorectal cancer: a review,” J. Nippon Med. Sch., vol. 89, no. 3, pp. 246–254, 2022. [Online]. Available: https://www.jstage.jst.go.jp/article/jnms/89/3/89_JNMS.2022_89-310/_article/-char/ja/

J. C. Sedlak, Ö. H. Yilmaz, and J. Roper, “Metabolism and colorectal cancer,” Annu. Rev. Pathol. Mech. Dis., vol. 18, no. 1, pp. 467–492, 2023. [Online]. Available: https://www.annualreviews.org/content/journals/10.1146/annurev-pathmechdis-031521-041113

S. M. Alzahrani, H. A. Al Doghaither, and A. B. Al-Ghafari, “General insight into cancer: An overview of colorectal cancer,” Mol. Clin. Oncol., vol. 15, no. 6, p. 271, Dec. 2021. [Online]. Available: https://www.spandidos-publications.com/10.3892/mco.2021.2433

M. Bretthauer et al., “Effect of colonoscopy screening on risks of colorectal cancer and related death,” N. Engl. J. Med., vol. 387, no. 17, pp. 1547–1556, Oct. 2022. [Online]. Available: https://www.nejm.org/doi/full/10.1056/NEJMoa2208375

D. C. Chung et al., “A cell-free DNA blood-based test for colorectal cancer screening,” N. Engl. J. Med., vol. 390, no. 11, pp. 973–983, Mar. 2024. [Online]. Available: https://www.nejm.org/doi/full/10.1056/NEJMoa2304714

T. Sawicki, M. Ruszkowska, A. Danielewicz, E. Niedźwiedzka, T. Arłukowicz, and K. E. Przybyłowicz, “A review of colorectal cancer in terms of epidemiology, risk factors, development, symptoms and diagnosis,” Cancers (Basel), vol. 13, no. 9, p. 2025, May 2021. [Online]. Available: https://www.mdpi.com/2072-6694/13/9/2025

V. A. Ionescu, G. Gheorghe, N. Bacalbasa, A. L. Chiotoroiu, and C. Diaconu, “Colorectal cancer: from risk factors to oncogenesis,” Medicina (Kaunas), vol. 59, no. 9, p. 1646, Sep. 2023. [Online]. Available: https://www.mdpi.com/1648-9144/59/9/1646

N. H. Minh, T. Q. Quy, N. D. Tam, T. M. Tuan, and L. H. Son, "A practical approach for colorectal cancer diagnosis based on machine learning," PLoS One, vol. 20, no. 4, p. e0321009, Apr. 2025. [Online]. Available: https://pmc.ncbi.nlm.nih.gov/articles/PMC12040227/

M. R. Wayahdi, F. Ruziq, and S. H. Ginting, “AI approach to predict student performance (Case study: Battuta University),” J. Sci. Soc. Res., vol. 7, no. 4, pp. 1800–1807, Nov. 2024. [Online]. Available: https://www.jurnal.goretanpena.com/index.php/JSSR/article/view/2332

V. Galaz et al., “Artificial intelligence, systemic risks, and sustainability,” Technol. Soc., vol. 67, p. 101741, Nov. 2021. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0160791X21002165

R. K. Halder, M. N. Uddin, M. A. Uddin, S. Aryal, and A. Khraisat, “Enhancing K-nearest neighbor algorithm: a comprehensive review and performance analysis of modifications,” J. Big Data, vol. 11, no. 1, p. 113, Aug. 2024. [Online]. Available: https://link.springer.com/article/10.1186/s40537-024-00973-y

M. R. Wayahdi and M. Zaki, “The Role of AI in Diagnosing Student Learning Needs: Solutions for More Inclusive Education,” Int. J. Educ. Insights Innov., vol. 2, no. 1, pp. 1–7, Mar. 2025. [Online]. Available: https://ijedins.technolabs.co.id/index.php/ijedins/article/view/6

M. M. Taye, “Understanding of machine learning with deep learning: architectures, workflow, applications and future directions,” Computers, vol. 12, no. 5, p. 91, Apr. 2023. [Online]. Available: https://www.mdpi.com/2073-431X/12/5/91

K. Sharifani and M. Amini, “Machine learning and deep learning: A review of methods and applications,” World Inf. Technol. Eng. J., vol. 10, no. 07, pp. 3897–3904, 2023. [Online]. Available: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4458723

S. Ramadhani and M. R. Wayahdi, “K-Nearest Neighbor and Random Forest Algorithms in Loan Approval Prediction,” J. Minfo Polgan, vol. 13, no. 1, pp. 1307–1313, Dec. 2024. [Online]. Available: https://jurnal.polgan.ac.id/index.php/jmp/article/view/14345

M. R. Wayahdi and F. Ruziq, “Predicting Smartphone Addiction Levels with K-Nearest Neighbors Using User Behavior Patterns”, J. Tek. Inform. (JUTIF), vol. 6, no. 5, pp. 3379–3391, Oct. 2025. [Online]. Available: https://jutif.if.unsoed.ac.id/index.php/jurnal/article/view/4905

M. A. Araaf, K. Nugroho, and D. R. Setiadi, “Comprehensive analysis and classification of skin diseases based on image texture features using K-nearest neighbors algorithm,” J. Comput. Theor. Appl., vol. 1, no. 1, pp. 31–40, Sep. 2023. [Online]. Available: https://dl.futuretechsci.org/id/eprint/85/

M. A. Khan et al., “Optimal feature selection for heart disease prediction using modified Artificial Bee colony (M-ABC) and K-nearest neighbors (KNN),” Sci. Rep., vol. 14, no. 1, p. 26241, Oct. 2024. [Online]. Available: https://www.nature.com/articles/s41598-024-78021-1

S. Hidayat, H. M. Ramadhan, and E. Y. Puspaningrum, “Comparison of K-nearest neighbor and decision tree methods using principal component analysis technique in heart disease classification,” Indones. J. Data Sci., vol. 4, no. 2, pp. 87–96, Jul. 2023. [Online]. Available: https://www.jurnal.yoctobrain.org/index.php/ijodas/article/view/70

M. R. Wayahdi and F. Ruziq, “KNN and XGBoost Algorithms for Lung Cancer Prediction,” J. Sci. Technol. (JoSTec), vol. 4, no. 1, Jan. 2022. [Online]. Available: https://ejournal.ipinternasional.com/index.php/jostec/article/view/251

A. Sumayli, “Development of advanced machine learning models for optimization of methyl ester biofuel production from papaya oil: Gaussian process regression (GPR), multilayer perceptron (MLP), and K-nearest neighbor (KNN) regression models,” Arab. J. Chem., vol. 16, no. 7, p. 104833, Jul. 2023. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S1878535223002952

D. M. Cao et al., “Advanced cybercrime detection: A comprehensive study on supervised and unsupervised machine learning approaches using real-world datasets,” J. Comput. Sci. Technol. Stud., vol. 6, no. 1, pp. 40–48, Jan. 2024. [Online]. Available: https://www.neliti.com/publications/589855/advanced-cybercrime-detection-a-comprehensive-study-on-supervised-and-unsupervis

M. R. Wayahdi, D. Syahputra, and S. H. Ginting, “Evaluation of the K-Nearest Neighbor Model With K-Fold Cross Validation on Image Classification,” Infokum, vol. 9, no. 1, pp. 1–6, Dec. 2020. [Online]. Available: http://seaninstitute.org/infor/index.php/infokum/article/view/72

M. Jagdish, A. M. Guzman, G. F. Sancho, and A. Guerrero-Luzuriaga, “Detection and classification of caterpillar using image processing with K-nearest neighbor classification technique,” Turk. J. Comput. Math. Educ., vol. 12, no. 5, pp. 719–728, 2021. [Online]. Available: https://www.proquest.com/openview/5fdb289afbaaf45f991102c89e259cf2/1?cbl=2045096&pq-origsite=gscholar

S. Anraeni, D. Indra, D. Adirahmadi, S. Pomalingo, and S. H. Mansyur, “Strawberry ripeness identification using feature extraction of RGB and K-nearest neighbor,” in 2021 3rd East Indonesia Conference on Computer and Information Technology (EIConCIT), 2021, pp. 395–398. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/9431854

M. I. Hutapea and A. P. Silalahi, “Moderna’s Vaccine Using the K-Nearest Neighbor (KNN) Method: An Analysis of Community Sentiment on Twitter,” J. Penelit. Pendidik. IPA, vol. 9, no. 5, pp. 3808–3814, May 2023. [Online]. Available: https://jppipa.unram.ac.id/index.php/jppipa/article/view/3203

S. Masturoh, R. L. Pratiwi, M. R. Saelan, and U. Radiyah, “Application of the k-nearest neighbor (KNN) algorithm in sentiment analysis of the Ovo e-wallet application,” JITK (Jurnal Ilmu Pengetahuan dan Teknologi Komputer), vol. 8, no. 2, pp. 84–89, Jan. 2023. [Online]. Available: https://ejournal.nusamandiri.ac.id/index.php/jitk/article/view/3997

Z. R. Tembusai, H. Mawengkang, and M. Zarlis, “K-nearest neighbor with k-fold cross validation and analytic hierarchy process on data classification,” Int. J. Adv. Data Inf. Syst., vol. 2, no. 1, pp. 1–8, 2021. [Online]. Available: https://www.neliti.com/publications/396954/k-nearest-neighbor-with-k-fold-cross-validation-and-analytic-hierarchy-process-o

Z. A. Sejuti and M. S. Islam, “A hybrid CNN–KNN approach for identification of COVID-19 with 5-fold cross validation,” Sens. Int., vol. 4, p. 100229, 2023. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S2666351123000037

A. M. P. Chacón, I. S. Ramírez, and F. P. G. Márquez, “K-nearest neighbour and K-fold cross-validation used in wind turbines for false alarm detection,” Sustain. Futures, vol. 6, p. 100132, 2023. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S266618882300028X

A. Yacob, N. E. Ghazali, and F. M. Hassan, “Sentiment Analysis of ChatGPT Using the KNN Algorithm and K-Fold Cross-Validation Optimization of the K Value,” J. Int. J. Inform. Comput., vol. 1, no. 2, pp. 48–55, 2024. [Online]. Available: https://www.researchgate.net/publication/389515999_Sentiment_Analysis_of_ChatGPT_Using_the_KNN_Algorithm_and_K-Fold_Cross-Validation_Optimization_of_the_K_Value

S. Hidayat, H. M. T. Ramadhan, and E. Y. Puspaningrum, “Comparison of K-nearest neighbor and decision tree methods using principal component analysis technique in heart disease classification,” Indones. J. Data Sci., vol. 4, no. 2, pp. 87–96, 2023. [Online]. Available: https://www.jurnal.yoctobrain.org/index.php/ijodas/article/view/70

A. Razzaque and A. Badholia, “PCA based feature extraction and MPSO based feature selection for gene expression microarray medical data classification,” Meas. Sens., vol. 31, p. 100945, 2024. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S2665917423002817

R. S. Rao, S. Dewangan, A. Mishra, and M. Gupta, “A study of dealing class imbalance problem with machine learning methods for code smell severity detection using PCA-based feature selection technique,” Sci. Rep., vol. 13, no. 1, p. 16245, 2023. [Online]. Available: https://www.nature.com/articles/s41598-023-43380-8

J. Barth, D. Katumullage, C. Yang, and J. Cao, “Classification of wines using principal component analysis,” J. Wine Econ., vol. 16, no. 1, pp. 56–67, 2021. [Online]. Available: https://www.cambridge.org/core/journals/journal-of-wine-economics/article/abs/classification-of-wines-using-principal-component-analysis/447CE06A9FA61D6950E3163FCF655ADF

X. Yan et al., “Classification of plastics using laser-induced breakdown spectroscopy combined with principal component analysis and K nearest neighbor algorithm,” Results Opt., vol. 4, p. 100093, 2021. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S2666950121000419

T. Aljrees, “Improving prediction of cervical cancer using KNN imputer and multi-model ensemble learning,” PLoS ONE, vol. 19, no. 1, p. e0295632, 2024. [Online]. Available: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0295632

D. Cheng, D. Zhao, J. Zhang, C. Wei, and D. Tian, “PCA-based denoising algorithm for outdoor lidar point cloud data,” Sensors (Basel), vol. 21, no. 11, p. 3703, 2021. [Online]. Available: https://www.mdpi.com/1424-8220/21/11/3703

F. Bahrambanan, M. Alizamir, K. Moradveisi, S. Heddam, S. Kim, S. Kim, M. Soleimani, S. Afshar, and A. Taherkhani, "The development of an efficient artificial intelligence-based classification approach for colorectal cancer response to radiochemotherapy: deep learning vs. machine learning," Sci. Rep., vol. 15, no. 62, Jan. 2025. [Online]. Available: https://www.nature.com/articles/s41598-024-84023-w

Additional Files

Published

2026-02-15

How to Cite

[1]
Y. Manza, R. Rosnelly, M. . Furqan, and B. S. . Reza, “Optimized KNN Performance with PCA and K-Fold Cross-Validation for Colorectal Cancer Survival Prediction”, J. Tek. Inform. (JUTIF), vol. 7, no. 1, pp. 361–372, Feb. 2026.