Air Quality Index Classification: Feature Selection for Improved Accuracy with Multinomial Logistic Regression

Authors

  • Rizky Caesar Irjayana Master Program of Informatics, Ahmad Dahlan University, Indonesia
  • Abdul Fadlil Department of Electrical Engineering, Ahmad Dahlan University, Indonesia
  • Rusydi Umar Department of Information System, Ahmad Dahlan University, Indonesia

DOI:

https://doi.org/10.52436/1.jutif.2025.6.5.5155

Keywords:

Air quality index, Classification, Data mining, K-fold cross validation, Multinomial logistic regression

Abstract

Air pollution is a major public health concern, creating the need for accurate and interpretable Air Quality Index (AQI) classification models. This study aims to classify AQI into three categories—Good, Moderate, and Unhealthy—using Multinomial Logistic Regression (MLR) with feature selection. The dataset, obtained from public monitoring stations in Jakarta between 2021 and 2024, initially contained 4,620 daily records. After cleaning and outlier removal, 3,586 valid samples remained, from which 900 balanced records (300 per class) were selected for modeling. Key features included PM₁₀, PM₂.₅, SO₂, CO, O₃, and NO₂, which were standardized using Max Normalization to ensure uniform feature scaling. The classification process applied k-fold cross-validation (k = 2–5), and performance was assessed using accuracy and Macro F1-score. Results show that including PM₂.₅ improves performance by about 10%, with the best outcome at k = 5 (accuracy = 91.67%, Macro F1 = 91.45%). These findings confirm PM₂.₅ as a decisive feature for AQI prediction and demonstrate that MLR provides a lightweight, transparent, and computationally efficient solution. Beyond environmental health, the contribution of this work lies in advancing data-driven decision support systems in Informatics, particularly for real-time monitoring and policy applications.

Downloads

Download data is not yet available.

References

R. C. Irjayana, A. Fadlil, and R. Umar, “Pengaruh Seleksi Fitur Terhadap Akurasi Klasifikasi Indeks Standar Pencemar Udara Menggunakan Naïve Bayes”, Insect (Informatics and Security), vol. 11, no. 1, pp. 67–78, 2025, doi: 10.33506/insect.v11i1.4303.

I. S. Mudway, F. J. Kelly, and S. T. Holgate, “Oxidative stress in air pollution research,” Free Radical Biology and Medicine, vol. 151, pp. 2-6, 2020, doi: 10.1016/j.freeradbiomed.2020.04.031.

F. Chen and Z. Chen, “Cost of Economic Growth: Air Pollution and Health Expenditure,” Science of the Total Environment, vol. 755, Part 1, p. 142543, 2021, doi: 10.1016/j.scitotenv.2020.142543.

J. Y. Xie, D. H. Suh, and S. -K. Joo, “A Dynamic Analysis of Air Pollution: Implications of Economic Growth and Renewable Energy Consumption”, International Journal of Environmental Research and Public Health, vol. 18, no. 18, p. 9906, 2021, doi: 10.3390/ijerph18189906.

Y. Wei, Y. Wang, X. Wu, Q. Di, L. Shi, P. Koutrakis, A. Zanobetti, F. Dominici, and J. D. Schwartz, “Causal Effects of Air Pollution on Mortality Rate in Massachusetts,” American Journal of Epidemiology, vol. 189, no. 11, pp. 1316-1323, 2020, doi: 10.1093/aje/kwaa098.

J. Lelieveld, A. Pozzer, U. Pöschl, M. Fnais, A. Haines, and T. Münzel, “Loss of life expectancy from air pollution compared to other risk factors: a worldwide perspective,” Cardiovascular Research, vol. 116, no. 11, pp. 1910–1917, 2020, doi: 10.1093/cvr/cvaa025.

WHO, “Nearly 50 million people sign up call for clean air action for better health,” [Online]. Available: https://www.who.int/news/item/17-03-2025-nearly-50-million-people-sign-up-call-for-clean-air-action-for-better-health, Accessed: Jul. 10, 2025.

Kementerian Dalam Negeri Republik Indonesia, “Inmendagri Tahun 2023,” [Online]. Available: https://ditjenbinaadwil.kemendagri.go.id/halaman/detail/inmendagri-tahun-2023, Accessed: Jul. 10, 2025.

A. A. Anandari, A. F. Wadjdi and G. Harsono, “Dampak Polusi Udara terhadap Kesehatan dan Kesiapan Pertahanan Negara di Provinsi DKI Jakarta,” Journal on Education, vol. 6, no. 2, pp. 10868-10884, 2024, doi: 10.31004/joe.v6i2.4880.

Y. Shino, Y. Durachman, and N. Sutisna, “Implementation of Data Mining with Naive Bayes Algorithm for Eligibility Classification of Basic Food Aid Recipients,” International Journal of Cyber and IT Service Management (IJCITSM), vol. 2, no. 2, pp. 154-162, 2022, doi: 10.34306/ijcitsm.v2i2.114.

A. N. Ali, G. Nassreddine, and J. Younis, “Air Quality prediction using Multinomial Logistic Regression,” Journal of Computer Science and Technology Studies, vol. 4, no. 2, pp. 71-78, 2022, doi: 10.32996/jcsts.2022.4.2.9.

R. Pratiwi, R. Widyasari, and M. Fathonni, “Analisis Regresi Logistik Multinomial Dalam Estimasi Parameter Kritis Indeks Standar Pencemar Udara,” Lebesgue: Jurnal Ilmiah Pendidikan Matematika, Matematika dan Statistika, vol. 5, no. 1, pp. 499-513, 2024, doi: 10.46306/lb.v5i1.588.

C. H. Feng, M. L. Disis, C. Cheng, and L. Zhang, “Multimetric feature selection for analyzing multicategory outcomes of colorectal cancer: random forest and multinomial logistic regression models,” Laboratory Investigation, vol. 102, no. 3, pp. 236-244, 2022, doi: 10.1038/s41374-021-00662-x.

E. Štokelj, T. Rus, J. Jamšek, M. Trošt, and U. Simončič, “Multinomial logistic regression algorithm for the classification of patients with parkinsonisms,” EJNMMI Research, vol. 15, no. 24, 2025, doi: 10.1186/s13550-025-01210-0.

Database Peraturan, “Indeks Standar Pencemar Udara,” [Online]. Available: https://peraturan.bpk.go.id/Details/163466/permen-lhk-no-14-tahun-2020, Accessed: Jul. 10 2025.

E. Alshdaifat, D. Alshdaifat, A. Alsarhan, F. Hussein, and S. M. F. S. El-Salhi, "The Effect of Preprocessing Techniques, Applied to Numeric Features, on Classification Algorithms’ Performance," Data, vol. 6, no. 2, p. 11, 2021, doi: 10.3390/data6020011.

H. A. Ahmed, P. J. M. Ali, A. K. Faeq and S. M. Abdullah, “An Investigation on Disparity Responds of Machine Learning Algorithms to Data Normalization Method,” ARO-The Scientific Journal Of Koya University, vol. 10, no. 2, pp. 29-37, 2022, doi: 10.14500/aro.10970.

D. Singh and B. Singh, “Investigating the impact of data normalization on classification performance,” Applied Soft Computing, vol. 97, Part B, p. 105524, 2020, doi: 10.1016/j.asoc.2019.105524.

R. R. Asaad and R. M. Abdulhakim, “The Concept of Data Mining and Knowledge Extraction Techniques,” Qubahan Academic Journal, vol. 1, no. 2, pp. 17-20, 2021, doi: 10.48161/qaj.v1n2a43.

Sunardi, A. Fadlil and N. M. P. Kusuma, “Implementasi Data Mining dengan Algoritma Naïve Bayes untuk Profiling Korban Penipuan Online di Indonesia,” Jurnal Media Informatika Budidarma, vol. 6, no. 3, pp. 1562-1572, 2022, doi: 10.30865/mib.v6i3.3999.

G. Shiran, R. Imaninasab, and R. Khayamim, “Crash Severity Analysis of Highways Based on Multinomial Logistic Regression Model, Decision Tree Techniques, and Artificial Neural Network: A Modeling Comparison,” Sustainability, vol. 13, no. 10, 2021, doi: 10.3390/su13105670.

A. Peryanto, A. Yudhana, and R. Umar, “Klasifikasi Citra Menggunakan Convolutional Neural Network dan K Fold Cross Validation,” Journal of Applied Informatics and Computing (JAIC), vol. 4, no. 1, pp. 45-51, 2020, doi: 10.30871/jaic.v4i1.2017.

L. Sha, M. Raković, A. Das, D. Gašević, and G. Chen, “Leveraging Class Balancing Techniques to Alleviate Algorithmic Bias for Predictive Tasks in Education,” IEEE Transactions on Learning Technologies, vol. 15, no. 4, pp. 481-492, 2022, doi: 10.1109/TLT.2022.3196278.

M. Mahmood, F. M. Jasem, A. A. Mukhlif, and B. Al-Khateeb, “Classifying cuneiform symbols using machine learning algorithms with unigram features on a balanced dataset,” Journal of Intelligent Systems, vol. 32, no. 1, p. 20230087, 2023, doi: 10.1515/jisys-2023-0087.

S. George and V. Srividhya, “Performance Evaluation of Sentiment Analysis on Balanced and Imbalanced Dataset Using Ensemble Approach,” Indian Journal of Science and Technology. vol.15, no. 17, pp. 790-797, 2022, doi: 10.17485/IJST/v15i17.2339.

R. Mohammed, J. Rawashdeh, and M. Abdullah, “Machine Learning with Oversampling and Undersampling Techniques: Overview Study and Experimental Results,” 2020 11th International Conference on Information and Communication Systems (ICICS), 2020, pp. 243-248, doi: 10.1109/ICICS49469.2020.239556.

D. Devi, S. K. Biswas, and B. Purkayastha, “A review on solution to class imbalance problem: Undersampling approaches,” 2020 International Conference on Computational Performance Evaluation (ComPE), 2020, pp. 626-631, doi: 10.1109/ComPE49325.2020.9200087.

I. Riadi, A. Fadlil, and B. A. Prabowo, “MAC Address Classification in Privacy Issue Using Gaussian Naïve Bayes,” JUITA: Jurnal Informatika, vol. 12, no. 2, pp. 235-242, 2024, doi: 10.30595/juita.v12i2.22571.

J. Opitz and S. Burst, “Macro F1 and Macro F1,” arXiv, 2021, doi: 10.48550/arXiv.1911.03347.

B. Wang, "A Parallel Implementation of Computing Mean Average Precision," arXiv preprint arXiv:2206.09504, p. 2, 2022, doi: 10.48550/arXiv.2206.09504.

D. Chicco and G. Jurman, "The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation," BMC Genomics, vol. 21, no. 6, pp. 1–13, 2020. doi: 10.1186/s12864-019-6413-7.

Additional Files

Published

2025-10-16

How to Cite

[1]
R. C. . Irjayana, A. Fadlil, and R. Umar, “Air Quality Index Classification: Feature Selection for Improved Accuracy with Multinomial Logistic Regression”, J. Tek. Inform. (JUTIF), vol. 6, no. 5, pp. 3265–3279, Oct. 2025.