Compiler fuzzing is a technique to test the functionalities of compiler. It requires well-formed test cases (i.e., programs) that have correct lexicons and syntax to pass the parsing stage of a compiler. Recently, advanced compiler fuzzing methods generate effective test cases by deep neural networks, which learn the language model of regular programs to guarantee test case quality. However, most of these methods fail to capture long-distance dependencies of syntax (e.g., paired curly braces) in a program. As a result, they may generate test cases with syntax errors, which cannot pass the parsing stage to test the compiler functionality. In this paper, we propose a framework, namely DSmith, to capture long-distance dependencies of syntax for a robust test case generation. Specifically, DSmith memorizes the hidden state of each token in a program and leverages the interactions of these hidden states to embed the long-distance dependencies between tokens. It then adopts an encoder-decoder architecture with the embedding of these long-distance dependencies to build a language model of regular programs. Finally, DSmith uses the built language model to generate test cases according to four novel generation strategies, which significantly increase the diversity of test cases. Extensive experiments show that DSmith increases the parsing pass rate of the generated programs by an average of 19% and significantly improves the code coverage of the compiler, compared with state-of-the-art methods. Benefiting from the high pass rate and broad code coverage, DSmith has found eleven brand new bugs in currently supported GCC compiler versions.
Chris CumminsPavlos PetoumenosAlastair MurrayHugh Leather
Zheng ZhangRui MaYuqi ZhaiYuche YangSiqi ZhaoHongming Chen
Kuiliang LinXiangpu SongYingpei ZengShanqing Guo