JOURNAL ARTICLE

Learning to Detect and Classify Malicious Executables in the Wild

J. Zico KolterMarcus A. Maloof

Year: 2006 Journal:   Journal of Machine Learning Research Vol: 7 (99)Pages: 2721-2744   Publisher: The MIT Press

Abstract

We describe the use of machine learning and data mining to detect and classify malicious executables as they appear in the wild. We gathered 1,971 benign and 1,651 malicious executables and encoded each as a training example using n-grams of byte codes as features. Such processing resulted in more than 255 million distinct n-grams. After selecting the most relevant n-grams for prediction, we evaluated a variety of inductive methods, including naive Bayes, decision trees, support vector machines, and boosting. Ultimately, boosted decision trees outperformed other methods with an area under the ROC curve of 0.996. Results suggest that our methodology will scale to larger collections of executables. We also evaluated how well the methods classified executables based on the function of their payload, such as opening a backdoor and mass-mailing. Areas under the ROC curve for detecting payload function were in the neighborhood of 0.9, which were smaller than those for the detection task. However, we attribute this drop in performance to fewer training examples and to the challenge of obtaining properly labeled examples, rather than to a failing of the methodology or to some inherent difficulty of the classification task. Finally, we applied detectors to 291 malicious executables discovered after we gathered our original collection, and boosted decision trees achieved a true-positive rate of 0.98 for a desired false-positive rate of 0.05. This result is particularly important, for it suggests that our methodology could be used as the basis for an operational system for detecting previously undiscovered malicious executables.

Keywords:
Executable Computer science Machine learning Artificial intelligence Support vector machine Naive Bayes classifier Flagging Decision tree False positive paradox False positive rate Payload (computing) Boosting (machine learning) Classifier (UML) Function (biology) Data mining Byte Task (project management) Gradient boosting Random forest Operating system Engineering

Metrics

572
Cited By
11.51
FWCI (Field Weighted Citation Impact)
39
Refs
0.99
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Advanced Malware Detection Techniques
Physical Sciences →  Computer Science →  Signal Processing
Spam and Phishing Detection
Physical Sciences →  Computer Science →  Information Systems
Network Security and Intrusion Detection
Physical Sciences →  Computer Science →  Computer Networks and Communications

Related Documents

BOOK-CHAPTER

Learning to Detect Malicious Executables

J. Zico KolterMarcus A. Maloof

Advanced information and knowledge processing Year: 2006 Pages: 47-63
BOOK-CHAPTER

Malicious Executables

Auerbach Publications eBooks Year: 2011 Pages: 111-118
BOOK-CHAPTER

Using Fuzzy Pattern Recognition to Detect Unknown Malicious Executables Code

Boyun ZhangJianping YinJingbo Hao

Lecture notes in computer science Year: 2005 Pages: 629-634
© 2026 ScienceGate Book Chapters — All rights reserved.