COMPARISON OF FEATURE SELECTION TO PERFORMANCE IMPROVEMENT OF K-NEAREST NEIGHBOR ALGORITHM IN DATA CLASSIFICATION

  • Iswanto Program Studi S2 Teknik Informatika, Fakultas Ilmu Komputer dan Teknologi Informasi, Universitas Sumatera Utara, Indonesia
  • Tulus Program Studi S2 Teknik Informatika, Fakultas Ilmu Komputer dan Teknologi Informasi, Universitas Sumatera Utara, Indonesia
  • Poltak Program Studi S2 Teknik Informatika, Fakultas Ilmu Komputer dan Teknologi Informasi, Universitas Sumatera Utara, Indonesia
Keywords: Gain Ratio, Gini Index, Information Gain, K-Nearest Neighbor

Abstract

One of the most widely used data classification methods is the K-Nearest Neighbor (K-NN) algorithm. Classification of data in this method is carried out based on the calculation of the closest distance to the training data as much as the value of K from its neighbors. Then the new data class is determined using the most votes system from the number of K nearest neighbors. However, the performance of this method is still lower than other data classification methods. The cause is the use of the most voting system in determining new data classes and the influence of features less relevant to the dataset. This study compares several feature selection methods in the data set to see their effects on the performance of the K-NN algorithm in data classification. The feature selection methods in this research are Information gain, Gain ratio, and Gini index. The method was tested on the Water Quality dataset from the Kaggle Repository to see the most optimal feature selection method. The test results on the dataset show that the use of the feature selection method affects to increase the performance of the K-NN algorithm. The average increase in the accuracy value obtained from the value of K=1 to K=15 is the Information Gain increased by 1.17%, Gain ratio increased by 0.69%, and the Gini index increased by 1.19%. The highest accuracy value in the classification of the Water Quality dataset is 89.66% at K=13 with the Information Gain feature selection method.

Downloads

Download data is not yet available.

References

A. Danades, D. Pratama, D. Anggraini, and D. Anggriani, "Comparison of Accuracy Level K-Nearest Neighbor Algorithm and Support Vector Machine Algorithm in Classification Water Quality Status", IEEE 6th International Conference on System Engineering and Technology (ICSET), pp. 137-141, October 3-4. 2016.

I.A. Angreni1 , S.A. Adisasmita , M.I. Ramli dan S. Hamid, ”Pengaruh Nilai K Pada Metode K-Nearest Neighbor (K-NN) Terhadap Tingkat Akurasi Identifikasi Kerusakan Jalan”, Rekayasa Sipil, vol. 7, no. 2 , pp. 63-70, September. 2018.

S.K. Lidya, O.S. Sitompul, and S. Effendi, “Sentiment Analysis Pada Teks Bahasa Indonesia Menggunakan Support Vector Machine (SVM) Dan KNearest Neighbor (K-NN)”, Seminar Nasional Teknologi Informasi dan Komunikasi, pp. 1-8, 2015.

A.C. Khotimah and E. Utami, “Comparison Naïve Bayes Classifier, K-Nearest Neighbor And Support Vector Machine In The Classification Of Individual On Twitter Account”, J. Tek. Inform. (JUTIF), vol. 3, no. 3, pp. 673-680, Jun. 2022.

Z. Pan, Y. Wang, and W. Ku, “A New K-Harmonic Nearest Neighbor Classifer Based On The Multi-Local Means”, Expert Systems With Applications,67: 115-125, 2016.

S. Samsani, “An RST based Efficient Preprocessing Technique for Handling Inconsistent Data”,IEEE International Conference on Computational Intelligence and Computing Research, 1-6, 2016.

X. Zhang, Z. Shi, X. Liu, and X. Li, “A Hybrid Feature Selection Algorithm For Classification Unbalanced Data Preprocessing”, International Conference on Smart Internet of Things, 1-6, 2018.

A.A. Nababan, O.S. Sitompul, and O. S, and Tulus, “Attribute Weighting Based K-Nearest Neighbor Using Gain Ratio”, MECnIT, 1-6, 2018.

I.T. Julianto, D. Kurniadi, M. R. Nashrulloh, and A. Mulyani, “Comparison Of Classification Algorithm And Feature Selection In Bitcoin Sentiment Analysis”, J. Tek. Inform. (JUTIF), vol. 3, no. 3, pp. 739-744, Jun. 2022.

T. Setiyorini and R. Asmono, “Implementation Of K-Nearest Neighbor And Gini Index Method In Classification Of Student Performance”, techno, vol. 16, no. 2, pp. 121-126, Sep. 2019.

A.K. Ginting, M.S. Lydia, E.M. Zamzami, “Peningkatan Akurasi Metode K-Nearest Neighbor dengan Seleksi Fitur Symmetrical Uncertainty”, Jurnal Media Informatika Budidarma, vol. 5, no. 4, pp. 1714-1719, Okt. 2021.

Y.A. Pratama, Tulus, and S. Effendi, “Selection Features To Improve The Accuracy Of K-Nearest Neighbor”, EPRA International Journal of Research and Development (IJRD), Vol. 4, pp. 115-119, 2019.

J. Han, J. Pei, and M. Kamber, "Data Mining Concept and Techniques, 3rd edition," Morgan Kaufmann-Elsevier. vol. 2, no. 1, pp. 88-97, 2012.

M. Hafidzullah, Sutrisno and Marji, “Seleksi Fitur dengan Information Gain pada Identifikasi Jenis Attention Deficit Hyperactivity Disorder Menggunakan Metode Modified K-Nearest Neighbor”, Jurnal Pengembangan Teknologi Informasi dan Ilmu Komputer, vol. 3, no. 11, p. 10444-10452, Jan 2020.

S. Chormunge, and S. Jena, “Efficient Feature Subset Selection Algorithm for High Dimensional Data”, International Journal of Electrical and Computer Engineering (IJECE), vol. 6, no. 4, pp. 1880-1888, August 2016.

A. Nermeen, Shaltout, M. El-Hefnawi, A. Rafea, and A. Moustafa, “Information Gain as a Feature Selection Method for the Efficient Classification of Influenza Based on Viral Hosts”, Proceedings of the World Congress on Engineering, vol. 1, p.625-631, July. 2-4, 2014.

G.F. Grandis, Y. Arumsari, dan Indriati, “Seleksi Fitur Gain Ratio pada Analisis Sentimen Kebijakan Pemerintah Mengenai Pembelajaran Jarak Jauh dengan K-Nearest Neighbor”, Jurnal Pengembangan Teknologi Informasi dan Ilmu Komputer, vol. 5, no. 8, pp. 3507-3514, Agustus 2021.

Published
2022-12-26
How to Cite
[1]
I. Iswanto, T. Tulus, and P. Poltak, “COMPARISON OF FEATURE SELECTION TO PERFORMANCE IMPROVEMENT OF K-NEAREST NEIGHBOR ALGORITHM IN DATA CLASSIFICATION”, J. Tek. Inform. (JUTIF), vol. 3, no. 6, pp. 1709-1716, Dec. 2022.