Leila Ben LetaifaJean-Luc Rouas
Transformer-based models have achieved state-of-the-art performance in various areas of machine learning, including automatic speech recognition. However, their cost in terms of computational power, memory or energy consumption can be exorbitant, hence the interest in compression techniques. Trans-former models are mostly composed of attention and feedforward components. In this paper, we propose to reduce the size of a transformer model in an end-to-end speech recognition system by decreasing the number and precision of linear layer parameters. Specifically, we investigate the impact of weight pruning on system performance. We then consider model quantization. To further reduce the model size, we address the combination of pruning and quantization methods. Experiments carried out on several speech datasets from different languages show that the memory footprint can be reduced by up to 84% with an insignificant loss of accuracy.
Yanzhang HeTara N. SainathRohit PrabhavalkarIan McGrawRaziel ÁlvarezDing ZhaoDavid RybachAnjuli KannanYonghui WuRuoming PangQiao LiangDeepti BhatiaYuan ShangguanBo LiGolan PundakKhe Chai SimTom BagbyShuo-Yiin ChangKanishka RaoAlexander Gruenstein
Leila Ben LetaifaJean-Luc Rouas
Chi-Hang LeongYu-Han HuangJen‐Tzung Chien
Takaaki HoriNiko MoritzChiori HoriJonathan Le Roux
Xuankai ChangWangyou ZhangYanmin QianJonathan Le RouxShinji Watanabe