Cross-lingual personalized speech generation seeks to synthesize a target speakers voice from only a few training samples that are in a different language. One popular technique is to condition a speech synthesizer on a speaker embedding, that characterizes the target speaker. Unfortunately, such a speaker embedding is usually affected by the language being spoken, which compromises the speaker similarity in cross-lingual personalized speech generation. In this paper, we propose a novel speaker encoding mechanism that learns a language agnostic speaker embedding to characterize speaker individuality. Specifically, we adopt an encoder-decoder architecture to disentangle the language information from speaker embeddings via multi-task learning. We conduct experiments on both voice conversion and text-to-speech synthesis between English and Mandarin that involve cross-lingual speech generation. All objective and subjective evaluations consistently confirm that the proposed speaker embedding is language agnostic, thus improving cross-lingual personalized speech generation in terms of speaker similarity.
Ji-Hoon KimHong-Sun YangYoon-Cheol JuIl Hwan KimByeong-Yeol KimJoon Son Chung
Mengnan ChenMinchuan ChenShuang LiangJun MaLei ChenShaojun WangJing Xiao
Sen LiuYiwei GuoChenpeng DuXie ChenKai Yu
Sebastián E. RodríguezHéctor Allende‐CidHéctor Allende
Xianze WuZaixiang ZhengHao ZhouYong Yu