Mongolian Emotional Speech Synthesis Based on CGAN and Improved FastSpeech2

Qing-Dao-Er-Ji Ren; Yang Yang; Lele Wang

doi:10.1145/3749102

ScienceGate Book Chapters

JOURNAL ARTICLE

Mongolian Emotional Speech Synthesis Based on CGAN and Improved FastSpeech2

Qing-Dao-Er-Ji Ren Yang Yang Lele Wang

Year: 2025 Journal: ACM Transactions on Asian and Low-Resource Language Information Processing Vol: 24 (9)Pages: 1-16 Publisher: Association for Computing Machinery

DOI: 10.1145/3749102

Get Full-Text PDF Get Analytical Report

Abstract

Mongolian speech synthesis is a technology that converts Mongolian text into Mongolian speech. In order to improve the emotional expressiveness of synthesized speech, this article first proposed a lightweight Mongolian phoneme pre-training model WFST-MnG2P based on weighted finite state transition machine. Secondly, as a representative low-resource language, Mongolian currently has no open source emotional speech corpus. For this reason, a Mongolian emotional speech corpus containing seven discrete emotions was constructed, totaling about 2.25 hours. Finally, since the non-autoregressive acoustic model can reduce word skipping, word missing, repeated pronunciation, and so on, and speed up the speech synthesis speed, this article proposes a Mongolian emotional speech synthesis model based on conditional generative adversarial network and improved FastSpeech2. Experimental results show that the average MOS score of emotional speech on the self-built Mongolian emotional speech corpus is 3.69, and the model can synthesize Mongolian emotional speech with rich multi-dimensional emotions and more robustness.

Keywords:

Computer science Speech recognition Pronunciation Speech synthesis Speech corpus Robustness (evolution) Natural language processing Artificial intelligence Linguistics

Metrics

Cited By

0.00

FWCI (Field Weighted Citation Impact)

Refs

0.13

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Topics

Speech Recognition and Synthesis

Physical Sciences → Computer Science → Artificial Intelligence

Music and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Natural Language Processing Techniques

Physical Sciences → Computer Science → Artificial Intelligence

Mongolian Emotional Speech Synthesis Based on CGAN and Improved FastSpeech2

Abstract

Metrics

Topics

Related Documents

FastSpeech2 Based Japanese Emotional Speech Synthesis

SRC-IT2: Speech Rate-Controllable Mongolian Emotional Speech Synthesis Based on Improved Tacotron2

Research on Tibetan Speech Synthesis Based on Fastspeech2

Mongolian emotional speech synthesis based on transfer learning and emotional embedding

EmoSpeech: guiding FastSpeech2 towards Emotional Text to Speech