This GitHub repository contains an implementation of a malware classification system using Convolutional Neural Networks (CNNs). The goal of this project is to develop a model capable of accurately classifying different types of malware based on their input executable as an image.
First implementation malimg_classifier
trained on 25 malware classes from Malimg dataset.
A second implementation combined_classifier
includes in the dataset a benign class extracted from PE legitimate samples in DikeDataset.
The full explanation of the experiments can be found in presentation.pdf
.
Malware (malicious software) poses a significant threat to computer systems and networks worldwide. It is crucial to detect and classify malware accurately to prevent potential security breaches. This project focuses on leveraging the power of CNNs, a deep learning technique commonly used in computer vision tasks, to classify malware samples into different categories.
The dataset Malimg used for this project contains labeled samples of different types of malware. Each sample is stored in a separate directory, with the directory name indicating the malware class.
A benign subset is stored in another folder which is uploaded in benign_data
, while the Malimg dataset can be found here.
The dataset is organized in the following structure:
malimg_dataset/
├── class1/
│ ├── malware1.png
│ ├── malware2.png
│ ├── ...
├── class2/
│ ├── malware3.png
│ ├── malware4.png
│ ├── ...
├── ...
benign_data/
├── benign_imgs/
│ ├── sample1.png
│ ├── sample2.png
│ ├── ...
You can find the full code in utils/data_conversion.ipynb
. Integrated from here and here.
The CNN model architecture used in this project consists of several convolutional layers, followed by pooling layers and fully connected layers. The CNN workflow is the following:
Overall | precision | recall | f1-score | support |
---|---|---|---|---|
accuracy | 0.8666 | 0.8666 | 0.8666 | 0.8666 |
macro avg | 0.81705 | 0.88241 | 0.83163 | 2054.0 |
weighted avg | 0.86608 | 0.8666 | 0.85959 | 2054.0 |
class | precision | recall | f1-score | support |
Adialer.C | 0.96 | 1.0 | 0.97959 | 24.0 |
Agent.FYI | 0.95833 | 1.0 | 0.97872 | 23.0 |
Allaple.A | 0.99313 | 0.98132 | 0.98719 | 589.0 |
Allaple.L | 1.0 | 0.99686 | 0.99843 | 318.0 |
Alueron.gen!J | 0.975 | 1.0 | 0.98734 | 39.0 |
Autorun.K | 0.11602 | 1.0 | 0.20792 | 21.0 |
Benign | 0.98658 | 0.75 | 0.85217 | 196.0 |
C2LOP.P | 0.39216 | 0.68966 | 0.5 | 29.0 |
C2LOP.gen!g | 0.63158 | 0.9 | 0.74227 | 40.0 |
Dialplatform.B | 1.0 | 0.97143 | 0.98551 | 35.0 |
Dontovo.A | 0.94118 | 1.0 | 0.9697 | 32.0 |
Fakerean | 0.98611 | 0.93421 | 0.95946 | 76.0 |
Instantaccess | 0.97727 | 1.0 | 0.98851 | 86.0 |
Lolyda.AA1 | 0.93333 | 1.0 | 0.96552 | 42.0 |
Lolyda.AA2 | 0.91892 | 0.94444 | 0.93151 | 36.0 |
Lolyda.AA3 | 0.88462 | 0.95833 | 0.92 | 24.0 |
Lolyda.AT | 0.9375 | 0.96774 | 0.95238 | 31.0 |
Malex.gen!J | 0.96154 | 0.92593 | 0.9434 | 27.0 |
Obfuscator.AD | 1.0 | 1.0 | 1.0 | 28.0 |
Rbot!gen | 0.88571 | 1.0 | 0.93939 | 31.0 |
Skintrim.N | 0.94118 | 1.0 | 0.9697 | 16.0 |
Swizzor.gen!E | 0.60714 | 0.68 | 0.64151 | 25.0 |
Swizzor.gen!I | 0.5 | 0.30769 | 0.38095 | 26.0 |
VB.AT | 0.89888 | 0.98765 | 0.94118 | 81.0 |
Wintrim.BX | 0.85714 | 0.94737 | 0.9 | 19.0 |
Yuner.A | 0.0 | 0.0 | 0.0 | 160.0 |
Gibert, D., Mateu, C., Planes, J. et al. Using convolutional neural networks for classification of malware represented as images. Using convolutional neural networks for classification of malware represented as images
Daniel Gibert, Carles Mateu, Jordi Planes, Journal of Network and Computer Applications, The rise of machine learning for detection and classification of malware: Research developments, trends and challenges. The rise of machine learning for detection and ... – ScienceDirect.
Songqing Yue, Tianyang Wang, Imbalanced Malware Images Classification: a CNN based Approach. Imbalanced Malware Images Classification: a CNN based Approach. Imbalanced Malware Images Classification: a CNN based Approach
Nataraj, Lakshmanan & Karthikeyan, Shanmugavadivel & Jacob, Grégoire & Manjunath, B.. (2011). Malware Images: Visualization and Automatic Classification. 10.1145/2016904.2016908. Malware Images: Visualization and Automatic Classification – ResearchGate.
M. Kalash, M. Rochan, N. Mohammed, N. D. B. Bruce, Y. Wang and F. Iqbal, "Malware Classification with Deep Convolutional Neural Networks," 2018 9th IFIP International Conference on New Technologies, Mobility and Security (NTMS), Paris, France, 2018, pp. 1-5, doi: 10.1109/NTMS.2018.8328749. Malware Classification with Deep Convolutional Neural Networks | IEEE ...
Tuan, Anh Pham; Phuong, An Tran Hung; Thanh, Nguyen Vu; Van, Toan Nguyen (2018). Malware Detection PE-Based Analysis Using Deep Learning Algorithm Dataset. figshare. Dataset. Malware Detection PE-Based Analysis Using Deep Learning Algorithm Datasethttps://figshare.com/articles/dataset/Malware_Detection_PE-Based_Analysis_Using_Deep_Learning_Algorithm_Dataset/6635642/1