Huu-Kim NguyenKihyuk JeongHong-Goo Kang
In this paper, we present a fast and lightweight speech synthesis model that is suitable for on-device applications. By leveraging the techniques of long-short range attention, depth-wise separable convolution, and linear attention, we significantly reduce the model size and complexity of the baseline FastSpeech2-based Transformer framework. Unlike the baseline model that requires O(N 2 ) to compute attention and convolution operations because of nested-loop computations, our proposed model only requires O(N) computations due to the modification of a nested-loop into two cascaded single loops. Experimental results show that our proposed model is able to generate speech with a real-time factor of 0.26 and requires only 10.4 million parameters. Despite the reduction in model size and complexity, still, the generated speech quality of our model is nearly close to the baseline.
Ba ZuRangzhuoma CaiZhijie CaiZhaxi Pengmao
Dengfeng KeRuixin HuQi LuoLiangjie HuangWenhan YaoWentao ShuJinsong ZhangYanlu Xie
Qing-Dao-Er-Ji RenYang YangLele Wang