Qing-Dao-Er-Ji RenYang YangLele Wang
Mongolian speech synthesis is a technology that converts Mongolian text into Mongolian speech. In order to improve the emotional expressiveness of synthesized speech, this article first proposed a lightweight Mongolian phoneme pre-training model WFST-MnG2P based on weighted finite state transition machine. Secondly, as a representative low-resource language, Mongolian currently has no open source emotional speech corpus. For this reason, a Mongolian emotional speech corpus containing seven discrete emotions was constructed, totaling about 2.25 hours. Finally, since the non-autoregressive acoustic model can reduce word skipping, word missing, repeated pronunciation, and so on, and speed up the speech synthesis speed, this article proposes a Mongolian emotional speech synthesis model based on conditional generative adversarial network and improved FastSpeech2. Experimental results show that the average MOS score of emotional speech on the self-built Mongolian emotional speech corpus is 3.69, and the model can synthesize Mongolian emotional speech with rich multi-dimensional emotions and more robustness.
Ren Qing-dao-er-jiQian BoChao ZhouYatu JiNier Wu
Ba ZuRangzhuoma CaiZhijie CaiZhaxi Pengmao
Aihong HuangFeilong BaoGuanglai GaoShan YuRui Liu