Recently, Deep learning (DL) compilers have been widely developed to optimize the deployment of DL models. These DL compilers transform DL models into high-level intermediate representation (IR) and then into low-level IR, ultimately generating optimized codes for different hardware targets. However, DL compilers are not immune to generating incorrect code, leading to potentially severe consequences. Testing techniques for low-level IR are limited, and efficient approaches for detecting some categories of non-crashing bugs are lacking. In this paper, we address the limitations of existing low-level IR DL compiler testing techniques and introduce DeepDiffer, a priority-guided differential testing framework designed to detect bugs resulting from low-level optimizations in the DL compiler, specifically TVM. We propose a novel DL compiler coverage metric and establish an optimization goal to maximize the detection of valuable differences between DL compilers. Our experiments demonstrate that DeepDiffer outperforms existing low-level IR fuzzers, detecting a wider range of bug types. In fact, DeepDiffer has successfully identified 13 bugs in TVM, which can be categorized into 9 distinct root causes, and 9 bugs are first found. We have submitted these bugs to the TVM community, where they have been confirmed.
Chris CumminsPavlos PetoumenosAlastair MurrayHugh Leather
Zhen ZhaoXiangpu SongQiuyu ZhongYingpei ZengChengyu HuShanqing Guo
Qingchao ShenHaoyang MaJunjie ChenYongqiang TianShing-Chi CheungXiang Chen