JOURNAL ARTICLE

DSmith: Compiler Fuzzing through Generative Deep Learning Model with Attention

Abstract

Compiler fuzzing is a technique to test the functionalities of compiler. It requires well-formed test cases (i.e., programs) that have correct lexicons and syntax to pass the parsing stage of a compiler. Recently, advanced compiler fuzzing methods generate effective test cases by deep neural networks, which learn the language model of regular programs to guarantee test case quality. However, most of these methods fail to capture long-distance dependencies of syntax (e.g., paired curly braces) in a program. As a result, they may generate test cases with syntax errors, which cannot pass the parsing stage to test the compiler functionality. In this paper, we propose a framework, namely DSmith, to capture long-distance dependencies of syntax for a robust test case generation. Specifically, DSmith memorizes the hidden state of each token in a program and leverages the interactions of these hidden states to embed the long-distance dependencies between tokens. It then adopts an encoder-decoder architecture with the embedding of these long-distance dependencies to build a language model of regular programs. Finally, DSmith uses the built language model to generate test cases according to four novel generation strategies, which significantly increase the diversity of test cases. Extensive experiments show that DSmith increases the parsing pass rate of the generated programs by an average of 19% and significantly improves the code coverage of the compiler, compared with state-of-the-art methods. Benefiting from the high pass rate and broad code coverage, DSmith has found eleven brand new bugs in currently supported GCC compiler versions.

Keywords:
Fuzz testing Computer science Compiler Parsing Compiler construction Compiler correctness Programming language Abstract syntax tree Optimizing compiler Code coverage Syntax Code generation Artificial intelligence Dead code elimination Software Operating system Object code

Metrics

14
Cited By
2.02
FWCI (Field Weighted Citation Impact)
30
Refs
0.85
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Software Testing and Debugging Techniques
Physical Sciences →  Computer Science →  Software
Software Engineering Research
Physical Sciences →  Computer Science →  Information Systems
Software System Performance and Reliability
Physical Sciences →  Computer Science →  Computer Networks and Communications

Related Documents

© 2026 ScienceGate Book Chapters — All rights reserved.