SYSTEMATIC LITERATURE REVIEW OF THE CLASS IMBALANCE CHALLENGES IN MACHINE LEARNING

  • Rifqi Fitriadi Computer Science Master's Study Program, Faculty of Information Technology, Universitas Budi Luhur, Indonesia
  • Deni Mahdiana Information Systems Study Program, Faculty of Information Technology, Universitas Budi Luhur, Indonesia
Keywords: Class Imbalance, Handling Method, Machine Learning, Systematic Literature Review

Abstract

The significant growth of data poses its own challenges, both in terms of storing, managing, and analyzing the available data. Untreated and unanalyzed data can only provide limited benefits to its owner. In many cases, the data we analyze is imbalanced. An example of natural data imbalance is in detecting financial fraud, where the number of non-fraudulent transactions is usually much higher than fraudulent ones. This imbalance issue can affect the accuracy and performance of machine learning classification models. Many machine learning classification models tend to learn more general patterns in the majority class. As a result, the model may overlook patterns that exist in the minority class. Various research has been conducted to address the problem of imbalanced data. The objective of this systematic literature review is to provide the latest developments regarding the cases, methods used, and evaluation techniques in handling imbalanced data. This research successfully identifies new methods and is expected to provide more choices for researchers so that imbalanced data can be properly handled, and classification models can produce unbiased, accurate, and consistent results.

Downloads

Download data is not yet available.

References

R. F. Daud and A. Novrimansyah, “Strategi Komunikasi Pemasaran Jamu Tradisional di Era Teknologi Digitalisasi 4.0,” Formosa Journal of Applied Sciences (FJAS), vol. 1, no. 3, pp. 233–248, 2022, doi: 10.55927.

S. Sumayah, F. Sembiring, and W. Jatmiko, “Analysis of Sentiment of Indonesian Community On Metaverse Using Support Vector Machine Algorithm,” Jurnal Teknik Informatika (JUTIF), vol. 4, no. 1, 2023, doi: 10.20884/1.jutif.2023.4.1.417.

R. Anggraeni and I. E. Maulani, “Pengaruh Teknologi Informasi Terhadap Perkembangan Bisnis Modern,” Jurnal Sosial dan Teknologi (SOSTECH), vol. 3, no. 2, 2023.

Y. Jumaryadi and D. Mahdiana, “Usability Testing of Budi Luhur University E-Earning System Using System Usability Scale,” Jurnal Teknik Informatika (JUTIF), vol. 3, no. 4, 2022.

D. Sawitri, “Revolusi Industri 4.0 : Big Data Menjawab Tantangan Revolusi Industri 4.0,” Jurnal Ilmiah Maksitek, vol. 4, no. 3, 2019.

L. Himanen, A. Geurts, A. S. Foster, and P. Rinke, “Data-Driven Materials Science: Status, Challenges, and Perspectives,” Advanced Science, vol. 6, no. 21. John Wiley and Sons Inc., Nov. 01, 2019. doi: 10.1002/advs.201900808.

A. K. Kar and Y. K. Dwivedi, “Theory building with big data-driven research – Moving away from the ‘What’ towards the ‘Why,’” Int J Inf Manage, vol. 54, Oct. 2020, doi: 10.1016/j.ijinfomgt.2020.102205.

M. Mohammadpoor and F. Torabi, “Big Data analytics in oil and gas industry: An emerging trend,” Petroleum, vol. 6, no. 4. KeAi Communications Co., pp. 321–328, Dec. 01, 2020. doi: 10.1016/j.petlm.2018.11.001.

H. Tamiminia, B. Salehi, M. Mahdianpari, L. Quackenbush, S. Adeli, and B. Brisco, “Google Earth Engine for geo-big data applications: A meta-analysis and systematic review,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 164. Elsevier B.V., pp. 152–170, Jun. 01, 2020. doi: 10.1016/j.isprsjprs.2020.04.001.

A. S. Ashraf and T. Ahmed, “Machine Learning Shrewd Approach For An Imbalanced Dataset Conversion Samples,” Journal of Engineering and Technology, 2020, [Online]. Available: https://journal.utem.edu.my/index.php/jet/index

N. Mqadi, N. Naicker, and T. Adeliyi, “A SMOTe based oversampling data-point approach to solving the credit card data imbalance problem in financial fraud detection,” International Journal of Computing and Digital Systems, vol. 10, no. 1, pp. 277–286, 2021, doi: 10.12785/IJCDS/100128.

Y. Azhar, A. Khoiriyah Firdausy, and P. J. Amelia, “Perbandingan Algoritma Klasifikasi Data Mining Untuk Prediksi Penyakit Stroke,” SINTECH JOURNAL, vol. 5, no. 2, 2022, [Online]. Available: https://doi.org/10.31598

S. Afzal et al., “A Data Augmentation-Based Framework to Handle Class Imbalance Problem for Alzheimer’s Stage Detection,” IEEE Access, vol. 7, pp. 115528–115539, 2019, doi: 10.1109/ACCESS.2019.2932786.

S. Mazurenko, Z. Prokop, and J. Damborsky, “Machine Learning in Enzyme Engineering,” ACS Catalysis, vol. 10, no. 2. American Chemical Society, pp. 1210–1223, Jan. 17, 2020. doi: 10.1021/acscatal.9b04321.

P. Vuttipittayamongkol and E. Elyan, “Neighbourhood-based undersampling approach for handling imbalanced and overlapped data,” Inf Sci (N Y), vol. 509, pp. 47–70, Jan. 2020, doi: 10.1016/j.ins.2019.08.062.

W. Mengist, T. Soromessa, and G. Legese, “Method for conducting systematic literature review and meta-analysis for environmental science research,” Science of the Total Environment, vol. 702. Elsevier B.V., Feb. 01, 2020. doi: 10.1016/j.scitotenv.2019.134581.

E. A. Felix and S. P. Lee, “Systematic literature review of preprocessing techniques for imbalanced data,” IET Software, vol. 13, no. 6. Institution of Engineering and Technology, pp. 479–496, Dec. 01, 2019. doi: 10.1049/iet-sen.2018.5193.

S. xia Chen, X. kang Wang, H. yu Zhang, and J. qiang Wang, “Customer purchase prediction from the perspective of imbalanced data: A machine learning framework based on factorization machine,” Expert Syst Appl, vol. 173, Jul. 2021, doi: 10.1016/j.eswa.2021.114756.

E. M. G. Prado, C. R. de Souza Filho, E. J. M. Carranza, and J. G. Motta, “Modeling of Cu-Au prospectivity in the Carajás mineral province (Brazil) through machine learning: Dealing with imbalanced training data,” Ore Geol Rev, vol. 124, Sep. 2020, doi: 10.1016/j.oregeorev.2020.103611.

L. Wang et al., “Classifying 2-year recurrence in patients with dlbcl using clinical variables with imbalanced data and machine learning methods,” Comput Methods Programs Biomed, vol. 196, Nov. 2020, doi: 10.1016/j.cmpb.2020.105567.

M. Pirizadeh, N. Alemohammad, M. Manthouri, and M. Pirizadeh, “A new machine learning ensemble model for class imbalance problem of screening enhanced oil recovery methods,” J Pet Sci Eng, vol. 198, Mar. 2021, doi: 10.1016/j.petrol.2020.108214.

S. Y. Bae, J. Lee, J. Jeong, C. Lim, and J. Choi, “Effective data-balancing methods for class-imbalanced genotoxicity datasets using machine learning algorithms and molecular fingerprints,” Computational Toxicology, vol. 20, Nov. 2021, doi: 10.1016/j.comtox.2021.100178.

S. Sarkar, A. Pramanik, J. Maiti, and G. Reniers, “Predicting and analyzing injury severity: A machine learning-based approach using class-imbalanced proactive and reactive data,” Saf Sci, vol. 125, May 2020, doi: 10.1016/j.ssci.2020.104616.

T. Liu, W. Fan, and C. Wu, “A hybrid machine learning approach to cerebral stroke prediction based on imbalanced medical dataset,” Artif Intell Med, vol. 101, Nov. 2019, doi: 10.1016/j.artmed.2019.101723.

H. Keshavarzi, A. Sadeghi-Sefidmazgi, A. Mirzaei, and R. Ravanifard, “Machine learning algorithms, bull genetic information, and imbalanced datasets used in abortion incidence prediction models for Iranian Holstein dairy cattle,” Prev Vet Med, vol. 175, Feb. 2020, doi: 10.1016/j.prevetmed.2019.104869.

M. Bourel et al., “Machine learning methods for imbalanced data set for prediction of faecal contamination in beach waters,” Water Res, vol. 202, Sep. 2021, doi: 10.1016/j.watres.2021.117450.

X. Wang, S. Li, T. Tang, X. Wang, and J. Xun, “Intelligent operation of heavy haul train with data imbalance: A machine learning method,” Knowl Based Syst, vol. 163, pp. 36–50, Jan. 2019, doi: 10.1016/j.knosys.2018.08.015.

M. T. Novaes et al., “Prediction of secondary testosterone deficiency using machine learning: A comparative analysis of ensemble and base classifiers, probability calibration, and sampling strategies in a slightly imbalanced dataset,” Inform Med Unlocked, vol. 23, Jan. 2021, doi: 10.1016/j.imu.2021.100538.

K. Alkharabsheh, S. Alawadi, V. R. Kebande, Y. Crespo, M. Fernández-Delgado, and J. A. Taboada, “A comparison of machine learning algorithms on design smell detection using balanced and imbalanced dataset: A study of God class,” Inf Softw Technol, vol. 143, Mar. 2022, doi: 10.1016/j.infsof.2021.106736.

G. Sambasivam and G. D. Opiyo, “A predictive machine learning application in agriculture: Cassava disease detection and classification with imbalanced dataset using convolutional neural networks,” Egyptian Informatics Journal, vol. 22, no. 1, pp. 27–34, Mar. 2021, doi: 10.1016/j.eij.2020.02.007.

S. Kaisar and A. Chowdhury, “Integrating oversampling and ensemble-based machine learning techniques for an imbalanced dataset in dyslexia screening tests,” ICT Express, vol. 8, no. 4, pp. 563–568, Dec. 2022, doi: 10.1016/j.icte.2022.02.011.

J. Ahmed and R. C. Green II, “Predicting severely imbalanced data disk drive failures with machine learning models,” Machine Learning with Applications, vol. 9, p. 100361, Sep. 2022, doi: 10.1016/j.mlwa.2022.100361.

B. Thiyam and S. Dey, “Efficient Feature Evaluation Approach for a class-imbalanced dataset using Machine learning,” Procedia Comput Sci, vol. 218, pp. 2520–2532, 2023, doi: 10.1016/j.procs.2023.01.226.

B. Thiyam and S. Dey, “Efficient Feature Evaluation Approach for a class-imbalanced dataset using Machine learning,” Procedia Comput Sci, vol. 218, pp. 2520–2532, 2023, doi: 10.1016/j.procs.2023.01.226.

G. Rekha, A. K. Tyagi, and V. K. Reddy, “A wide scale classification of class imbalance problem and its solutions: A systematic literature review,” Journal of Computer Science, vol. 15, no. 7. Science Publications, pp. 886–929, 2019. doi: 10.3844/jcssp.2019.886.929.

Published
2023-10-05
How to Cite
[1]
Rifqi Fitriadi and Deni Mahdiana, “SYSTEMATIC LITERATURE REVIEW OF THE CLASS IMBALANCE CHALLENGES IN MACHINE LEARNING”, J. Tek. Inform. (JUTIF), vol. 4, no. 5, pp. 1099-1107, Oct. 2023.