IMPROVING MALWARE DETECTION USING INFORMATION GAIN AND ENSEMBLE MACHINE LEARNING

  • Arsabilla Ramadhani Informatics Engineering, Faculty of Computer Science, Universitas Dian Nuswantoro, Indonesia
  • Fauzi Adi Rafrastara Informatics Engineering, Faculty of Computer Science, Universitas Dian Nuswantoro, Indonesia
  • Salma Rosyada Informatics Engineering, Faculty of Computer Science, Universitas Dian Nuswantoro, Indonesia
  • Wildanil Ghozi Informatics Engineering, Faculty of Computer Science, Universitas Dian Nuswantoro, Indonesia
  • Waleed Mahgoub Osman Mathematics Department, College of Education, Sudan University of Science and Technology, Sudan
Keywords: Ensemble-based Algorithms, Gradient Boosting, Information Gain, Machine Learning, Malware Detection

Abstract

Malware attacks pose a serious threat to digital systems, potentially causing data and financial losses. The increasing complexity and diversity of malware attack techniques have made traditional detection methods ineffective,  thus AI-based approaches are needed to improve the accuracy and efficiency of malware detection, especially for detecting modern malware that uses obfuscation techniques. This study addresses this issue by applying ensemble-based machine learning algorithms to enhance malware detection accuracy. The methodology used involves Random Forest, Gradient Boosting, XGBoost, and AdaBoost, with feature selection using Information Gain. Datasets from VirusTotal and VxHeaven, including both goodware and malware samples. The results show that Gradient Boosting, strengthened with Information Gain, achieved the highest accuracy of 99.1%, indicating a significant improvement in malware detection effectiveness. This study demonstrates that applying Information Gain to Gradient Boosting can improve malware detection accuracy while reducing computational requirements, contributing significantly to the optimization of digital security systems.

Downloads

Download data is not yet available.

References

F. A. Rafrastara, C. Supriyanto, A. Amiral, S. R. Amalia, M. D. Al Fahreza, and F. Ahmed, “Performance Comparison of k-Nearest Neighbor Algorithm with Various k Values and Distance Metrics for Malware Detection,” mib, vol. 8, no. 1, p. 450, Jan. 2024, doi: 10.30865/mib.v8i1.6971.

D. Sigh and J. S. Samagh, “A COMPREHENSIVE REVIEW OF HEART DISEASE PREDICTION USING MACHINE LEARNING,” jcr, vol. 7, no. 12, Jun. 2020, doi: 10.31838/jcr.07.12.54.

S. Aurangzeb, H. Anwar, M. A. Naeem, and M. Aleem, “BigRC-EML: big-data based ransomware classification using ensemble machine learning,” Cluster Comput, vol. 25, no. 5, pp. 3405–3422, Oct. 2022, doi: 10.1007/s10586-022-03569-4.

S. Ghafur, S. Kristensen, K. Honeyford, G. Martin, A. Darzi, and P. Aylin, “A retrospective impact analysis of the WannaCry cyberattack on the NHS,” npj Digit. Med., vol. 2, no. 1, p. 98, Oct. 2019, doi: 10.1038/s41746-019-0161-6.

M. S. Akhtar and T. Feng, “Malware Analysis and Detection Using Machine Learning Algorithms,” Symmetry, vol. 14, no. 11, p. 2304, Nov. 2022, doi: 10.3390/sym14112304.

N. Pachhala, S. Jothilakshmi, and B. P. Battula, “Cross-Platform Malware Classification: Fusion of CNN and GRU Models,” IJSSE, vol. 14, no. 2, pp. 477–486, Apr. 2024, doi: 10.18280/ijsse.140215.

M. Almahmoud, D. Alzu’bi, and Q. Yaseen, “ReDroidDet: Android Malware Detection Based on Recurrent Neural Network,” Procedia Computer Science, vol. 184, pp. 841–846, 2021, doi: 10.1016/j.procs.2021.03.105.

O. AbuAlghanam, H. Alazzam, M. Qatawneh, O. Aladwan, M. A. Alsharaiah, and M. A. Almaiah, “Android Malware Detection System Based on Ensemble Learning,” Jan. 31, 2023. doi: 10.21203/rs.3.rs-2521341/v1.

K. Maharana, S. Mondal, and B. Nemade, “A review: Data pre-processing and data augmentation techniques,” Global Transitions Proceedings, vol. 3, no. 1, pp. 91–99, Jun. 2022, doi: 10.1016/j.gltp.2022.04.020.

A. Mihoub, S. Zidi, and L. Laouamer, “Investigating Best Approaches for Activity Classification in a Fully Instrumented Smarthome Environment,” IJMLC, vol. 10, no. 2, pp. 299–308, Feb. 2020, doi: 10.18178/ijmlc.2020.10.2.935.

D. Singh and B. Singh, “Investigating the impact of data normalization on classification performance,” Applied Soft Computing, vol. 97, p. 105524, Dec. 2020, doi: 10.1016/j.asoc.2019.105524.

A. M. Priyatno, L. Ningsih, and M. Noor, “Harnessing Machine Learning for Stock Price Prediction with Random Forest and Simple Moving Average Techniques,” jesa, vol. 1, no. 1, pp. 1–8, Mar. 2024, doi: 10.69693/jesa.v1i1.1.

Y.-W. Chong, T. Emad Ali, S. Manickam, M. N. Yusoff, K.-L. Alvin Yau, and S.-L. Keoh, “A Ddos Attack Detection Framework: Leveraging Feature Selection Integration and Random Forest Optimization for Improved Security,” 2023. doi: 10.2139/ssrn.4651305.

Kurniabudi, D. Stiawan, Darmawijoyo, M. Y. Bin Idris, A. M. Bamhdi, and R. Budiarto, “CICIDS-2017 Dataset Feature Analysis With Information Gain for Anomaly Detection,” IEEE Access, vol. 8, pp. 132911–132921, 2020, doi: 10.1109/ACCESS.2020.3009843.

T. A. Alhaj, M. M. Siraj, A. Zainal, H. T. Elshoush, and F. Elhaj, “Feature Selection Using Information Gain for Improved Structural-Based Alert Correlation,” PLoS ONE, vol. 11, no. 11, p. e0166017, Nov. 2016, doi: 10.1371/journal.pone.0166017.

M. I. Prasetiyowati, N. U. Maulidevi, and K. Surendro, “Determining threshold value on information gain feature selection to increase speed and prediction accuracy of random forest,” J Big Data, vol. 8, no. 1, p. 84, Dec. 2021, doi: 10.1186/s40537-021-00472-4.

D. Theng and K. K. Bhoyar, “Feature selection techniques for machine learning: a survey of more than two decades of research,” Knowl Inf Syst, vol. 66, no. 3, pp. 1575–1637, Mar. 2024, doi: 10.1007/s10115-023-02010-5.

M. Pompiliu Cristescu, “Tools Used in Modeling of the Economic Processes,” KSS, Jan. 2020, doi: 10.18502/kss.v4i1.5985.

D. Goretzko and M. Bühner, “One model to rule them all? Using machine learning algorithms to determine the number of factors in exploratory factor analysis.,” Psychological Methods, vol. 25, no. 6, pp. 776–786, Dec. 2020, doi: 10.1037/met0000262.

S. Tamilselvi and S. Ragul, “Classification of IoT Network Traffic using Random Forest Classifier,” vol. 3, no. 02, pp. 16–25, Feb. 2024.

M. Savargiv, B. Masoumi, and M. R. Keyvanpour, “A New Random Forest Algorithm Based on Learning Automata,” Computational Intelligence and Neuroscience, vol. 2021, no. 1, p. 5572781, Jan. 2021, doi: 10.1155/2021/5572781.

A. Farki, R. Baradaran Kazemzadeh, and E. Akhondzadeh Noughabi, “A Novel Clustering-Based Algorithm for Continuous and Noninvasive Cuff-Less Blood Pressure Estimation,” Journal of Healthcare Engineering, vol. 2022, pp. 1–13, Jan. 2022, doi: 10.1155/2022/3549238.

T. Duan et al., “NGBoost: Natural Gradient Boosting for Probabilistic Prediction,” vol. 119, pp. 2690–2700, 2020.

N. Fragkis, “Assessment and comparison of existing methods and datasets for sentiment analysis of Greek texts,” Πανεπιστήμιο Δυτικής Αττικής, 2022. Accessed: Sep. 20, 2024. [Online]. Available: https://polynoe.lib.uniwa.gr/xmlui/handle/11400/2411

Kartina Diah Kusuma Wardani and Memen Akbar, “Diabetes Risk Prediction using Feature Importance Extreme Gradient Boosting (XGBoost),” J. RESTI (Rekayasa Sist. Teknol. Inf.), vol. 7, no. 4, pp. 824–831, Aug. 2023, doi: 10.29207/resti.v7i4.4651.

B. Perry, “AdaBoost And Its Variants: Boosting Methods For Classification With Small Sample Size And Brain Activity In Schizophrenia,” 2023. [Online]. Available: http://hdl.handle.net/10464/17817

G. Battineni, G. G. Sagaro, C. Nalini, F. Amenta, and S. K. Tayebati, “Comparative Machine-Learning Approach: A Follow-Up Study on Type 2 Diabetes Predictions by Cross-Validation Methods,” Machines, vol. 7, no. 4, p. 74, Dec. 2019, doi: 10.3390/machines7040074.

G. Orrù, M. Monaro, C. Conversano, A. Gemignani, and G. Sartori, “Machine Learning in Psychometrics and Psychological Research,” Front. Psychol., vol. 10, p. 2970, Jan. 2020, doi: 10.3389/fpsyg.2019.02970.

C. Supriyanto, F. A. Rafrastara, A. Amiral, S. R. Amalia, M. D. Al Fahreza, and Mohd. F. Abdollah, “Malware Detection Using K-Nearest Neighbor Algorithm and Feature Selection,” mib, vol. 8, no. 1, p. 412, Jan. 2024, doi: 10.30865/mib.v8i1.6970.

C. Ciciana, R. Rahmawati, and L. Qadrini, “The Utilization of Resampling Techniques and the Random Forest Method in Data Classification,” tin, vol. 4, no. 4, pp. 252–259, Sep. 2023, doi: 10.47065/tin.v4i4.4342.

M. D. Ramasamy, K. Periasamy, L. Krishnasamy, R. K. Dhanaraj, S. Kadry, and Y. Nam, “Multi-Disease Classification Model Using Strassen’s Half of Threshold (SHoT) Training Algorithm in Healthcare Sector,” IEEE Access, vol. 9, pp. 112624–112636, 2021, doi: 10.1109/ACCESS.2021.3103746.

S. Gupta and B. Singh, “An intelligent multi-layer framework with SHAP integration for botnet detection and classification,” Computers & Security, vol. 140, p. 103783, May 2024, doi: 10.1016/j.cose.2024.103783.

Published
2024-12-09
How to Cite
[1]
A. Ramadhani, F. A. Rafrastara, S. Rosyada, W. Ghozi, and W. M. Osman, “IMPROVING MALWARE DETECTION USING INFORMATION GAIN AND ENSEMBLE MACHINE LEARNING”, J. Tek. Inform. (JUTIF), vol. 5, no. 6, pp. 1673-1686, Dec. 2024.

Most read articles by the same author(s)