The research presented in the paper handles the problem of multilingual text-to-speech, particularly its capability of synthesis of speech when the appropriate combination of desired properties (speaker, language, speaking style) is missing from the training corpus. The model proposed in the paper achieves cross-lingual speech synthesis through the use of neural network embeddings, applied not only to speaker and speaking style IDs, but also to context-dependent phonemes and a range of prosodic events, including accents and phrase breaks. This allows the model to efficiently capture relationships between phones and prosodic events in different languages, and consequently to synthesize speech in the voice of a person who has never spoken the target language or used a target style. The proposed model was trained on speech corpora of American English and Serbo-Croatian. A range of experiments including subjective evaluation of synthesis was carried out to establish both the quality of synthesis in different scenarios and under different conditions, as well as the similarity of speaker voices between cross-lingual and original language scenario.
Sen LiuYiwei GuoChenpeng DuXie ChenKai Yu
Elvys Linhares PontesCarlos-Emiliano González-GallardoJuan‐Manuel Torres‐MorenoStéphane Huet
Mengnan ChenMinchuan ChenShuang LiangJun MaLei ChenShaojun WangJing Xiao
HyoJeon YoonDinh Tuyen HoangNgoc Thanh NguyênDosam Hwang
Ryuichi YamamotoYuma ShirahataMasaya KawamuraKentaro Tachibana