Grammar-based fuzzing is known to be an effective technique for checking security vulnerabilities in programs, such as parsers, which take complex structured inputs. Unfortunately, most of existing grammar-based fuzzers require a lot of manual efforts of writing complex input grammars, which hinders their practical use. To address this problem, recently proposed approaches use machine learning to automatically acquire a generative model for structured inputs conforming to a complex grammar. Even such approaches, however, have major limitations: they fail to learn a generative model for instruction sequences, and they cannot achieve good coverage of instruction-parsing code. To overcome such limitations. this paper proposes a collection of techniques for enhancing learning-assisited grammar-based fuzzing. Our approach allows for the learning of a generative model for instruction sequences by training a hybrid character/token-level recursive neural network. In addition, we exploit coverage metrics gathered during previous runs of fuzzing in order to efficiently refine (or fine-tune) the learnt model so that it can make high coverage-inducing new inputs. Our experiments with a real PDF parser show that our approach succeeded in generating new sequences of instructions (in PDF page streams) that induce better code coverage (of the PDF parser) than state-of-the-art learning-assisted grammar-based fuzzers.
Yuma JitsunariYoshitaka ArahoriKatsuhiko Gondow
Jiawei WuSenyi LiJunqiang LiLong LuoHongfang YuGang Sun
Tim MeywerkVladimir HerdtRolf Drechsler
Yuanping NieXiong XiaoBing YangHanqing LiLong LuoHongfang YuGang Sun
Xiaohui WanTiancheng LiWeibin LinYi CaiZheng Zheng