This paper presents the design and FPGA implementation of a convolutional neural network accelerator (CNNA). Two kinds of sparsity, zero-valued weights and zero-valued input feature map, are exploited to save power. The design features hierarchical memory organization to reduce external memory access. Bandwidth compression and decompression are also proposed to reduce external memory bandwidth. The unified scratch memory can be configured dynamically layer-by-layer to maximize memory utilization. The proposed CNNA is designed in Xilinx high level synthesis (HLS) language and implemented on ZCU102 board. With totally 2048 multiply-and-accumulation (MAC) unit, the design is able to deliver 1TOPS computing power when running at 250MHz.
YU Zijian,MA De,YAN Xiaolang,SHEN Juncheng
Jincheng ZouQing TangCongcong He
Junye SiJianfei JiangQin WangJia Huang
Kasem KhalilAshok KumarMagdy Bayoumi
Yang ChenGaomiao XuLin ChenJiabao Gao