OPTIMIZING ANDROID MALWARE DETECTION USING NEURAL NETWORKS AND FEATURE SELECTION METHOD
Abstract
Malware poses a serious threat to Android security systems. In recent years, Android malware has rapidly evolved, employing obfuscation techniques such as polymorphic and metamorphic. Unfortunately, signature-based malware detection cannot identify modern variants of Android malware. This study aims to compare various feature selection methods and machine learning algorithms to identify the most effective and efficient combination for classifying Android malware. The dataset used in this research is the Drebin dataset. Four classification algorithms are used in this comparison: Naive Bayes, Logistic Regression, Neural Network, and Random Forest. The best-performing algorithm is then implemented in three different scenarios: without feature selection, with Information Gain, and with Chi-Squared (X²). In the latter two scenarios, the appropriate number of features was selected using the backward elimination method. Both feature selections achieved the same performance, but Information Gain required fewer features. The evaluation metrics used in this study include AUC, accuracy, F1-score, training time, and testing time. Measuring training and testing time benefits the model by making it more efficient, thus allowing for faster detection in real-world applications. The results show that the combination of the Information Gain feature selection method and the Neural Network algorithm achieves the highest performance, with an accuracy and F1-Score of 98.6%. Additionally, this combination achieves a training time of 81.135 seconds and a testing time of 1.095 seconds. Compared to the Neural Network algorithm without feature selection, this combination results in a 17.7597 % reduction in training time and a 57.9977 % reduction in testing time while maintaining the same performance values. This research contributes to improving the speed and accuracy of malware detection systems, enhancing mobile security.
Downloads
References
S. Garg and N. Baliyan, “Comparative analysis of Android and iOS from security viewpoint,” Comput Sci Rev, vol. 40, p. 100372, May 2021, doi: 10.1016/J.COSREV.2021.100372.
T. Sharma and D. Rattan, “Malicious application detection in android — A systematic literature review,” Comput Sci Rev, vol. 40, p. 100373, May 2021, doi: 10.1016/J.COSREV.2021.100373.
J. D. Ndibwile, E. T. Luhanga, D. Fall, and Y. Kadobayashi, “A demographic perspective of smartphone security and its redesigned notifications,” Journal of Information Processing, vol. 27, pp. 773–786, 2019, doi: 10.2197/ipsjjip.27.773.
S. Garg and N. Baliyan, “Android security assessment: A review, taxonomy and research gap study,” Comput Secur, vol. 100, p. 102087, Jan. 2021, doi: 10.1016/J.COSE.2020.102087.
R. Mayrhofer, J. Vander Stoep, C. Brubaker, and N. Kralevich, “The Android Platform Security Model,” ACM Transactions on Privacy and Security, vol. 24, no. 3, Apr. 2021, doi: 10.1145/3448609.
R. Sikder, M. S. Khan, M. S. Hossain, and W. Z. Khan, “A survey on android security: Development and deployment hindrance and best practices,” Telkomnika (Telecommunication Computing Electronics and Control), vol. 18, no. 1, pp. 485–499, Feb. 2020, doi: 10.12928/TELKOMNIKA.V18I1.13288.
A. Qamar, A. Karim, and V. Chang, “Mobile malware attacks: Review, taxonomy & future directions,” Future Generation Computer Systems, vol. 97, pp. 887–909, Aug. 2019, doi: 10.1016/J.FUTURE.2019.03.007.
T. Yerlikaya and S. Sen, “Hacking Android Mobile Phone with Phishing,” Journal" Fundamental Sciences and Applications", vol. 27, pp. 1–7, Dec. 2021.
J. Ferdous, R. Islam, A. Mahboubi, and M. Z. Islam, “A Review of State-of-the-Art Malware Attack Trends and Defense Mechanisms,” IEEE Access, vol. 11, pp. 121118–121141, 2023, doi: 10.1109/ACCESS.2023.3328351.
L. Franceschi-Bicchierai and R. Coluccini, “Researchers find google play store apps were actually government malware,” 2019.
M. Caianiello, “Criminal Process faced with the Challenges of Scientific and Technological Development,” European Journal of Crime, Criminal Law and Criminal Justice, vol. 27, no. 4, pp. 267–291, Dec. 2019, doi: 10.1163/15718174-02704001.
O. Aslan and R. Samet, “A Comprehensive Review on Malware Detection Approaches,” 2020, Institute of Electrical and Electronics Engineers Inc. doi: 10.1109/ACCESS.2019.2963724.
D. Gibert, C. Mateu, J. Planes, and J. Marques-Silva, “Auditing static machine learning anti-Malware tools against metamorphic attacks,” Comput Secur, vol. 102, Mar. 2021, doi: 10.1016/j.cose.2020.102159.
D. H. Gillani, “A perspective study on Malware detection and protection, A review,” 2022, doi: 10.22541/au.166308976.63086986/v1.
J. Singh and J. Singh, “A survey on machine learning-based malware detection in executable files,” Journal of Systems Architecture, vol. 112, p. 101861, Jan. 2021, doi: 10.1016/J.SYSARC.2020.101861.
M. S. Rana and A. H. Sung, “Evaluation of Advanced Ensemble Learning Techniques for Android Malware Detection,” Vietnam Journal of Computer Science, vol. 7, no. 2, pp. 145–159, May 2020, doi: 10.1142/S2196888820500086.
S. A. Roseline and S. Geetha, “Android Malware Detection and Classification using LOFO Feature Selection and Tree-based Models,” in Journal of Physics: Conference Series, IOP Publishing Ltd, Jun. 2021. doi: 10.1088/1742-6596/1911/1/012031.
C. Supriyanto, F. A. Rafrastara, A. Amiral, S. R. Amalia, M. D. Al Fahreza, and Mohd. F. Abdollah, “Malware Detection Using K-Nearest Neighbor Algorithm and Feature Selection,” JURNAL MEDIA INFORMATIKA BUDIDARMA, vol. 8, no. 1, p. 412, Jan. 2024, doi: 10.30865/mib.v8i1.6970.
F. A. Rafrastara, C. Supriyanto, C. Paramita, and Y. P. Astuti, “Deteksi Malware menggunakan Metode Stacking berbasis Ensemble,” Jurnal Informatika: Jurnal Pengembangan IT, vol. 8, no. 1, pp. 11–6, 2023.
M. A. Albahar, M. S. Elsayed, and A. Jurcut, “A Modified ResNeXt for Android Malware Identification and Classification,” Comput Intell Neurosci, vol. 2022, 2022, doi: 10.1155/2022/8634784.
F. Thabtah, S. Hammoud, F. Kamalov, and A. Gonsalves, “Data imbalance in classification: Experimental evaluation,” Inf Sci (N Y), vol. 513, pp. 429–441, Mar. 2020, doi: 10.1016/J.INS.2019.11.004.
K. Md. Hasib et al., “A Survey of Methods for Managing the Classification and Solution of Data Imbalance Problem,” Journal of Computer Science, vol. 16, no. 11, pp. 1546–1557, Dec. 2020, doi: 10.3844/jcssp.2020.1546.1557.
C. Fan, M. Chen, X. Wang, J. Wang, and B. Huang, “A Review on Data Preprocessing Techniques Toward Efficient and Reliable Knowledge Discovery From Building Operational Data,” Mar. 29, 2021, Frontiers Media S.A. doi: 10.3389/fenrg.2021.652801.
S. Rao, P. Poojary, J. Somaiya, and P. Mahajan, “A COMPARATIVE STUDY BETWEEN VARIOUS PREPROCESSING TECHNIQUES FOR MACHINE LEARNING,” International Journal of Engineering Applied Sciences and Technology, vol. 5, no. 3, pp. 431–438, Jul. 2020.
V. Çetin and O. Yıldız, “A comprehensive review on data preprocessing techniques in data analysis,” Pamukkale University Journal of Engineering Sciences, vol. 28, no. 2, pp. 299–312, 2022, doi: 10.5505/pajes.2021.62687.
R. Mohammed, J. Rawashdeh, and M. Abdullah, “Machine Learning with Oversampling and Undersampling Techniques: Overview Study and Experimental Results,” 2020 11th International Conference on Information and Communication Systems, ICICS 2020, pp. 243–248, Apr. 2020, doi: 10.1109/ICICS49469.2020.239556.
S. Tangirala, “Evaluating the Impact of GINI Index and Information Gain on Classification using Decision Tree Classifier Algorithm*,” International Journal of Advanced Computer Science and Applications, vol. 11, no. 2, pp. 612–619, 2020.
R. Spencer, F. Thabtah, N. Abdelhamid, and M. Thompson, “Exploring feature selection and classification methods for predicting heart disease,” Digit Health, vol. 6, 2020, doi: 10.1177/2055207620914777.
M. Z. I. Chowdhury and T. C. Turin, “Variable selection strategies and its importance in clinical prediction modelling,” Fam Med Community Health, vol. 8, no. 1, Feb. 2020, doi: 10.1136/fmch-2019-000262.
K. Kurniabudi, A. Harris, and A. E. Mintaria, “Komparasi Information Gain, Gain Ratio, CFs-Bestfirst dan CFs-PSO Search Terhadap Performa Deteksi Anomali,” JURNAL MEDIA INFORMATIKA BUDIDARMA, vol. 5, no. 1, p. 332, Jan. 2021, doi: 10.30865/mib.v5i1.2258.
S. K. Trivedi, “A study on credit scoring modeling with different feature selection and machine learning approaches,” Technol Soc, vol. 63, Nov. 2020, doi: 10.1016/j.techsoc.2020.101413.
M. Schonlau and R. Y. Zou, “The random forest algorithm for statistical learning,” Stata Journal, vol. 20, no. 1, pp. 3–29, Mar. 2020, doi: 10.1177/1536867X20909688.
M. Islam, G. Chen, and S. Jin, “An Overview of Neural Network,” American Journal of Neural Networks and Applications, vol. 5, no. 1, p. 7, 2019, doi: 10.11648/j.ajnna.20190501.12.
S. Bhatia and J. Malhotra, “Naïve bayes classifier for predicting the novel coronavirus,” in Proceedings of the 3rd International Conference on Intelligent Communication Technologies and Virtual Mobile Networks, ICICV 2021, Institute of Electrical and Electronics Engineers Inc., Feb. 2021, pp. 880–883. doi: 10.1109/ICICV50876.2021.9388410.
N. R. Panda, J. K. Pati, J. N. Mohanty, and R. Bhuyan, “A Review on Logistic Regression in Medical Research,” National Journal of Community Medicine, vol. 13, no. 04, pp. 265–270, Apr. 2022, doi: 10.55489/NJCM.134202222.
B. G. Marcot and A. M. Hanea, “What is an optimal value of k in k-fold cross-validation in discrete Bayesian network analysis?,” Comput Stat, vol. 36, no. 3, pp. 2009–2031, Sep. 2021, doi: 10.1007/S00180-020-00999-9/METRICS.
S. M. Malakouti, M. B. Menhaj, and A. A. Suratgar, “The usage of 10-fold cross-validation and grid search to enhance ML methods performance in solar farm power generation prediction,” Clean Eng Technol, vol. 15, Aug. 2023, doi: 10.1016/j.clet.2023.100664.
J. Xu, Y. Zhang, and D. Miao, “Three-way confusion matrix for classification: A measure driven view,” Inf Sci (N Y), vol. 507, pp. 772–794, Jan. 2020, doi: 10.1016/J.INS.2019.06.064.
M. Te Wu, “Confusion matrix and minimum cross-entropy metrics based motion recognition system in the classroom,” Sci Rep, vol. 12, no. 1, Dec. 2022, doi: 10.1038/s41598-022-07137-z.
J. Muschelli, “ROC and AUC with a Binary Predictor: a Potentially Misleading Metric,” J Classif, vol. 37, no. 3, pp. 696–708, Oct. 2020, doi: 10.1007/s00357-019-09345-1.
F. A. Rafrastara, C. Supriyanto, C. Paramita, Y. P. Astuti, and F. Ahmed, “Performance Improvement of Random Forest Algorithm for Malware Detection on Imbalanced Dataset using Random Under-Sampling Method,” urnal Informatika: Jurnal Pengembangan IT, vol. 8, no. 2, pp. 113–118, May 2023.
D. Chicco and G. Jurman, “The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation,” BMC Genomics, vol. 21, no. 1, Jan. 2020, doi: 10.1186/s12864-019-6413-7.
I. D. Apostolopoulos, I. Athanasoula, M. Tzani, and P. P. Groumpos, “An Explainable Deep Learning Framework for Detecting and Localising Smoke and Fire Incidents: Evaluation of Grad-CAM++ and LIME,” Mach Learn Knowl Extr, vol. 4, no. 4, pp. 1124–1135, Dec. 2022, doi: 10.3390/make4040057.
J. Woo, S. H. Jo, G. S. Byun, B. S. Kwon, and J. H. Jeong, “Wearable airbag system for real-time bicycle rider accident recognition by orthogonal convolutional neural network (O-cnn) model,” Electronics (Switzerland), vol. 10, no. 12, Jun. 2021, doi: 10.3390/electronics10121423..
Copyright (c) 2024 Jevan Bintoro, Fauzi Adi Rafrastara, Ines Aulia Latifah, Wildani Ghozi, Warusia Yassin

This work is licensed under a Creative Commons Attribution 4.0 International License.