In this paper, we optimize a widely used kernel, radial basis function, in a support vector machine as a case study to evaluate the potential of using FPGAs and the capabilities of high-level synthesis (HLS) for data intensive applications. We explain the HLS flow, and use it to develop and evaluate the kernels optimized with vectorization, loop unrolling, and half-precision storage format. Our optimizations improve the kernel performance by a factor of 15.8 compared to a baseline kernel on the Nallatech 385A FPGA card that features an Intel Arria 10 GX 1150 FPGA. The half storage format can reduce the DSP and memory utilizations at the cost of increasing the logic utilization. Compared to the single-precision floating-point kernels, the half-precision kernels can reduce the dynamic power consumption on the FPGA by approximately 30%. In terms of energy efficiency, the performance per watt on the FPGA platform is approximately 3X higher than that on an Intel Xeon 16-core CPU, and 1.8X higher than that on an Nvidia Tesla K80 GPU. On the other hand, the raw performance on the FPGA is approximately 2X and 2.7X lower than that on the CPU and GPU, respectively.
Zheming JinIris JohnsonHal Finkel
Y. WangWendong MaoLang FengJin ShaZhongfeng Wang