Preventing Data Leakage in Classification via Integrated Machine Learning Pipelines: Preprocessing, Feature Transformation, and Hyperparameter Tuning

Arief Ichwani; Rahman Indra Kesuma; Andika Setiawan; Imam Eko Wicaksono; Raidah  Hanifah

doi:10.52436/1.jutif.2026.7.1.5490

Authors

Arief Ichwani Teknik Informatika, Institut Teknologi Sumatera, Indonesia
Rahman Indra Kesuma Teknik Informatika, Institut Teknologi Sumatera, Indonesia
Andika Setiawan Teknik Informatika, Institut Teknologi Sumatera, Indonesia
Imam Eko Wicaksono Teknik Informatika, Institut Teknologi Sumatera, Indonesia
Raidah Hanifah Teknik Informatika, Institut Teknologi Sumatera, Indonesia

DOI:

https://doi.org/10.52436/1.jutif.2026.7.1.5490

Keywords:

Classification pipeline, Data leakage prevention, Feature transformation, K-nearest neighbors, Machine learning preprocessing

Abstract

Data leakage in machine learning classification often leads to overfitting, inflated performance estimates, and poor reproducibility, which can undermine the reliability of deployed models and incur industrial losses. This paper addresses the leakage problem by proposing an integrated machine learning pipeline that strictly isolates training and evaluation processes across preprocessing, feature transformation, and model optimization stages. Experiments are conducted on the Titanic passenger survival dataset, where exploratory data analysis identifies data quality issues, followed by stratified train-test splitting to preserve class distribution. All preprocessing steps, including missing value imputation, categorical encoding, and feature scaling, are applied exclusively to the training data using a ColumnTransformer embedded within a unified Pipeline. A K-Nearest Neighbors (KNN) classifier is employed, with hyperparameters optimized via GridSearchCV and 3-fold cross-validation. Experimental results show that a baseline model without leakage control achieves only 72.62% test accuracy and exhibits a substantial overfitting gap. In contrast, the proposed pipeline-based approach improves generalization, achieving 78.21% test accuracy with an optimal configuration of k = 29 and Manhattan distance while significantly reducing overfitting. The main contribution of this work is the formulation of a reproducible, leakage-aware pipeline guideline that ensures unbiased evaluation and reliable generalization in classification tasks, providing practical methodological insights for both academic research and real-world machine learning applications.

Downloads

Download data is not yet available.

References

I. H. Sarker, “Machine Learning: Algorithms, Real-World Applications and Research Directions,” SN Comput. Sci., vol. 2, no. 3, pp. 1–21, 2021, doi: 10.1007/s42979-021-00592-x.

S. Dong, S. Sahri, and T. Palpanas, “Data Quality Awareness: A Journey from Traditional Data Management to Data Science Systems,” 2024, [Online]. Available: http://arxiv.org/abs/2411.03007

S. Mohammed et al., “The effects of data quality on machine learning performance on tabular data,” Inf. Syst., vol. 132, pp. 1–50, 2025, doi: 10.1016/j.is.2025.102549.

X. Ying, “An Overview of Overfitting and its Solutions,” J. Phys. Conf. Ser., vol. 1168, no. 2, 2019, doi: 10.1088/1742-6596/1168/2/022022.

M. Jegorova et al., “Survey: Leakage and Privacy at Inference Time,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 7, pp. 9090–9108, 2023, doi: 10.1109/TPAMI.2022.3229593.

M. A. Lones, “How to avoid machine learning pitfalls: a guide for academic researchers,” pp. 1–33, 2021, doi: 10.1016/j.patter.2024.101046.

S. Kaufman, S. Rosset, and C. Perlich, “Leakage in data mining: Formulation, detection, and avoidance,” Proc. ACM SIGKDD Int. Conf. Knowl. Discov. Data Min., pp. 556–563, 2011, doi: 10.1145/2020408.2020496.

B. Ghojogh and M. Crowley, “The Theory Behind Overfitting, Cross Validation, Regularization, Bagging, and Boosting: Tutorial,” no. 3, 2019, [Online]. Available: http://arxiv.org/abs/1905.12787

D. Krstajic, L. J. Buturovic, D. E. Leahy, and S. Thomas, “Cross-validation pitfalls when selecting and assessing regression and classification models,” J. Cheminform., vol. 6, no. 1, pp. 1–15, 2014, doi: 10.1186/1758-2946-6-10.

L. Sasse et al., “On Leakage in Machine Learning Pipelines,” arXiv, 2023, [Online]. Available: http://arxiv.org/abs/2311.04179

S. Kapoor and A. Narayanan, “Leakage and the reproducibility crisis in machine-learning-based science,” Patterns, vol. 4, no. 9, p. 100804, 2023, doi: 10.1016/j.patter.2023.100804.

L. Sasse, J. Dukart, S. B. Eickhoff, M. Götz, S. Hamdan, and V. Komeyer, “Overview of leakage scenarios in supervised machine learning,” J. Big Data, 2025.

M. A. Zoller and M. F. Huber, “Benchmark and Survey of Automated Machine Learning Frameworks,” J. Artif. Intell. Res., 2021.

R. Blagus and L. Lusa, “Joint use of over- and under-sampling techniques and cross-validation for the development and assessment of prediction models,” BMC Bioinformatics, pp. 1–10, 2015, doi: 10.1186/s12859-015-0784-9.

A. Apicella and F. Isgr, “Don ’ t Push the Button ! Exploring Data Leakage Risks in Machine Learning and Transfer Learning,” Artif. Intell. Rev. Springer, 2025.

K. R. P. Nicolás Nieto, Simon B. Eickhoff, Christian Jung, Martin Reuter, Kersten Diers, Malte Kelm, Artur Lichtenberg, Federico Raimondo, “Data leakage inflates prediction performance in connectome-based machine learning models,” Nat. Commun., pp. 1–15, 2024, doi: 10.1038/s41467-024-46150-w.

E. A. Alomar, C. Demario, R. Shagawat, and B. Kreiser, “LeakageDetector : An Open Source Data Leakage Analysis Tool in Machine Learning Pipelines,” arXiv, 2025.

S. Varma and R. Simon, “Bias in error estimation when using cross-validation for model selection,” BMC Bioinforma. Springer, vol. 8, pp. 1–8, 2006, doi: 10.1186/1471-2105-7-91.

A. Hannun, “Measuring Data Leakage in Machine-Learning Models with Fisher Information,” no. Uai, 2021.

A. Apicella, F. Isgrò, R. Prevete, and F. Isgr, “Don’t Push the Button! Exploring Data Leakage Risks in Machine Learning and Transfer Learning,” Artif. Intell. Rev. Springer, 2025, [Online]. Available: http://arxiv.org/abs/2401.13796

S. Rosset, “On the cross-validation bias due to unsupervised preprocessing,” no. 2013.

S. Biswas, Fair Preprocessing : Towards Understanding Compositional Fairness of Data Transformers in Machine Learning Pipeline, vol. 1, no. 1. Association for Computing Machinery. doi: 10.1145/3468264.3468536.

M. Pagan, M. Zarlis, and A. Candra, “Investigating the impact of data scaling on the k-nearest neighbor algorithm,” vol. 4, no. 2, pp. 135–142, 2023, doi: 10.11591/csit.v4i2.pp135-142.

L. B. V. De Amorim, G. D. C. Cavalcanti, and R. M. O. Cruz, “The choice of scaling technique matters for classification performance,” pp. 1–37, 2022.

B. Li, “On Convergence of Nearest Neighbor Classifiers over 1NN Error,” no. c, 2020.

A. A. Amer, S. D. Ravana, R. Ahamed, and A. Habeeb, “Effective k ‑ nearest neighbor models for data classification enhancement,” J. Big Data, 2025, doi: 10.1186/s40537-025-01137-2.

A. B. Hassanat, M. A. Abbadi, and A. A. Alhasanat, “Solving the Problem of the K Parameter in the KNN Classifier Using an Ensemble Learning Approach,” vol. 12, no. 8, pp. 33–39, 2014.

M. S. S. Dadhania, “Improved kNN Algorithm by Optimizing Cross-validation,” vol. 1, no. 3, pp. 1–6, 2012.

A. Lodwich, F. Shafait, and T. Breuel, “Efficient Estimation of k for the Nearest Neighbors Class of Methods,” 2011.

G. Data, M. Becker, and M. Atzmueller, “Adaptive kNN Using Expected Accuracy for Classification of,” 2018.

G. A. Lewis, R. A. Brower-sinning, and C. Kästner, Data Leakage in Notebooks : Static Detection and Better Processes, vol. 1, no. 1. Association for Computing Machinery, 2022. doi: 10.1145/3551349.3556918.

P. Li, X. Rao, J. Blase, Y. Zhang, X. Chu, and C. Zhang, “CleanML : A Study for Evaluating the Impact of Data Cleaning on ML Classification Tasks,” no. 1, pp. 1–13.