Towards Optimization of Malware Detection using Extra-Tree and Random Forest Feature Selections on Ensemble Classifiers
Fadare Oluwaseun Gbenga1, Adetunmbi Adebayo Olusola2, Oyinloye Oghenerukevwe Elohor3

1Fadare Oluwaseun Gbenga, Department of Computer Science, Ekiti State University, Nigeria.
2Adetunmbi Adebayo Olusola, Professor, Department of Computer Science, Federal University of Technology, Akure.
3Oyinloye Oghenerukevwe Elohor, Department of Computer Science, Ekiti State University, Nigeria.

Manuscript received on March 16, 2021. | Revised Manuscript received on March 25, 2021. | Manuscript published on March 30, 2021. | PP: 223-232 | Volume-9 Issue-6, March 2021. | Retrieval Number: 100.1/ijrte.F5545039621 | DOI: 10.35940/ijrte.F5545.039621
Open Access | Ethics and Policies | Cite | Mendeley
© The Authors. Blue Eyes Intelligence Engineering and Sciences Publication (BEIESP). This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)

Abstract: The proliferation of Malware on computer communication systems posed great security challenges to confidential data stored and other valuable substances across the globe. There have been several attempts in curbing the menace using a signature-based approach and in recent times, machine learning techniques have been extensively explored. This paper proposes a framework combining the exploit of both feature selections based on extra tree and random forest and eight ensemble techniques on five base learners- KNN, Naive Bayes, SVM, Decision Trees, and Logistic Regression. K-Nearest Neighbors returns the highest accuracy of 96.48%, 96.40%, and 87.89% on extra-tree, random forest, and without feature selection (WFS) respectively. Random forest ensemble accuracy on both Feature Selections are the highest with 98.50% and 98.16% on random forest and extra-tree respectively. The Extreme Gradient Boosting Classifier is next on random-forest FS with an accuracy of 98.37% while Voting returns the least detection accuracy of 95.80%. On extra-tree FS, Bagging is next with a detection accuracy of 98.09% while Voting returns the least accuracy of 95.54%. Random Forest has the highest all in seven evaluative measures in both extra tree and random forest feature selection techniques. The study results uncover the tree-based ensemble model is proficient and successful for malware classification. 
Keywords: Extra-tree, random forest, K-Nearest Neighbors, Extreme Gradient Boosting Classifier, Random forest ensemble.