Sho KoAlexander RuckerYaqi ZhangPaul MureKunle Olukotun
A significant trend in machine learning is sparsifying the training of neural networks to reduce the amount of computation required. Algorithms like Sub-LInear Deep learning Engine (SLIDE) [2] use locality-sensitive hashing (LSH) to create sparsity. These sparse training algorithms were originally developed on multi-threaded multicore CPUs. However, they are not well-studied and optimized for accelerator platforms such as GPUs and reconfigurable dataflow architectures (RDAs). In this paper, we study the different variants of the SLIDE algorithm and investigate accuracy-performance tradeoffs on CPU, GPU, and RDAs. The implementation targeting RDA outperforms the GPU by 7.5×. The performance on a limited-memory RDA is improved further by proposing a smart caching algorithm, which is 2 × faster than the baseline RDA. Furthermore, we are able to achieve another 2 × performance by putting all of the weights on-chip using an RDA with enough memory. We believe our work will pave the road for the future development of both algorithm and hardware architecture for sparse training.
Benjamin LetschertKshitij KulshreshthaAndrea WaltherDuc Cuong NguyenAssefaw H. GebremedhinAlex Pothen
Gerasimos GerogiannisSriram AananthakrishnanJosep TorrellasIbrahim Hur
Weizhi XuYintai SunShengyu FanHui YuXin Fu
Seongwook KimYong-Jun KimGwangeun ByeonSeokin Hong
Joshua L. ProctorSteven L. BruntonBingni W. BruntonJ. Nathan Kutz