Learning to Detect and Classify Malicious Executables in the Wild

J. Zico Kolter; Marcus A. Maloof

JOURNAL ARTICLE

Learning to Detect and Classify Malicious Executables in the Wild

Year: 2006 Journal: Journal of Machine Learning Research Vol: 7 (99)Pages: 2721-2744 Publisher: The MIT Press

Get Full-Text PDF Get Analytical Report

Abstract

We describe the use of machine learning and data mining to detect and classify malicious executables as they appear in the wild. We gathered 1,971 benign and 1,651 malicious executables and encoded each as a training example using n-grams of byte codes as features. Such processing resulted in more than 255 million distinct n-grams. After selecting the most relevant n-grams for prediction, we evaluated a variety of inductive methods, including naive Bayes, decision trees, support vector machines, and boosting. Ultimately, boosted decision trees outperformed other methods with an area under the ROC curve of 0.996. Results suggest that our methodology will scale to larger collections of executables. We also evaluated how well the methods classified executables based on the function of their payload, such as opening a backdoor and mass-mailing. Areas under the ROC curve for detecting payload function were in the neighborhood of 0.9, which were smaller than those for the detection task. However, we attribute this drop in performance to fewer training examples and to the challenge of obtaining properly labeled examples, rather than to a failing of the methodology or to some inherent difficulty of the classification task. Finally, we applied detectors to 291 malicious executables discovered after we gathered our original collection, and boosted decision trees achieved a true-positive rate of 0.98 for a desired false-positive rate of 0.05. This result is particularly important, for it suggests that our methodology could be used as the basis for an operational system for detecting previously undiscovered malicious executables.

Keywords:

Executable Computer science Machine learning Artificial intelligence Support vector machine Naive Bayes classifier Flagging Decision tree False positive paradox False positive rate Payload (computing) Boosting (machine learning) Classifier (UML) Function (biology) Data mining Byte Task (project management) Gradient boosting Random forest Operating system Engineering

Metrics

572

Cited By

11.51

FWCI (Field Weighted Citation Impact)

Refs

0.99

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Advanced Malware Detection Techniques

Physical Sciences → Computer Science → Signal Processing

Spam and Phishing Detection

Physical Sciences → Computer Science → Information Systems

Network Security and Intrusion Detection

Physical Sciences → Computer Science → Computer Networks and Communications

Learning to Detect and Classify Malicious Executables in the Wild

Abstract

Metrics

Citation History

Topics

Related Documents

Learning to detect malicious executables in the wild

Learning to Detect Malicious Executables

A Hybrid Model to Detect Malicious Executables

Malicious Executables

Using Fuzzy Pattern Recognition to Detect Unknown Malicious Executables Code