Malware Classification Using Dynamically Extracted API Call Embeddings

Sahil Aggarwal; Fabio Di Troia

doi:10.3390/app14135731

ScienceGate Book Chapters

JOURNAL ARTICLE

Malware Classification Using Dynamically Extracted API Call Embeddings

Sahil Aggarwal Fabio Di Troia

Year: 2024 Journal: Applied Sciences Vol: 14 (13)Pages: 5731-5731 Publisher: Multidisciplinary Digital Publishing Institute

DOI: 10.3390/app14135731

Get Full-Text PDF Get Analytical Report

Abstract

Malware classification stands as a crucial element in establishing robust computer security protocols, encompassing the segmentation of malware into discrete groupings. Recently, the emergence of machine learning has presented itself as an apt approach for addressing this challenge. Models can undergo training employing diverse malware attributes, such as opcodes and API calls, to distill valuable insights for effective classification. Within the realm of natural language processing, word embeddings assume a pivotal role by representing text in a manner that aligns closely with the proximity of similar words. These embeddings facilitate the quantification of word resemblances. This research embarks on a series of experiments that harness hybrid machine learning methodologies. We derive word vectors from dynamic API call logs associated with malware and integrate them as features in collaboration with diverse classifiers. Our methodology involves the utilization of Hidden Markov Models and Word2Vec to generate embeddings from API call logs. Additionally, we amalgamate renowned models like BERT and ELMo, noted for their capacity to yield contextualized embeddings. The resultant vectors are channeled into our classifiers, namely Support Vector Machines (SVMs), Random Forest (RF), k-Nearest Neighbors (kNNs), and Convolutional Neural Networks (CNNs). Through two distinct sets of experiments, our objective revolves around the classification of both malware families and categories. The outcomes achieved illuminate the efficacy of API call embeddings as a potent instrument in the domain of malware classification, particularly in the realm of identifying malware families. The best combination was RF and word embeddings generated by Word2Vec, ELMo, and BERT, achieving an accuracy between 0.91 and 0.93. This result underscores the potential of our approach in effectively classifying malware.

Keywords:

Word2vec Malware Computer science Artificial intelligence Random forest Support vector machine Machine learning Convolutional neural network Natural language processing Word (group theory) Opcode Programming language

Metrics

Cited By

4.28

FWCI (Field Weighted Citation Impact)

Refs

0.90

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Advanced Malware Detection Techniques

Physical Sciences → Computer Science → Signal Processing

Spam and Phishing Detection

Physical Sciences → Computer Science → Information Systems

Network Security and Intrusion Detection

Physical Sciences → Computer Science → Computer Networks and Communications

Malware Classification Using Dynamically Extracted API Call Embeddings

Abstract

Metrics

Citation History

Topics

Related Documents

Malware Classification using API Call Information and Word Embeddings

Malware detection framework based on graph variational autoencoder extracted embeddings from API-call graphs

Malware Classification Method Using API Call Categorization

Malware Classification using Opcode N-grams and Word Embeddings

Deep learning for effective Android malware detection using API call graph embeddings