JOURNAL ARTICLE

Cross-lingual Text-to-Speech with Prosody Embedding

Abstract

The research presented in the paper handles the problem of multilingual text-to-speech, particularly its capability of synthesis of speech when the appropriate combination of desired properties (speaker, language, speaking style) is missing from the training corpus. The model proposed in the paper achieves cross-lingual speech synthesis through the use of neural network embeddings, applied not only to speaker and speaking style IDs, but also to context-dependent phonemes and a range of prosodic events, including accents and phrase breaks. This allows the model to efficiently capture relationships between phones and prosodic events in different languages, and consequently to synthesize speech in the voice of a person who has never spoken the target language or used a target style. The proposed model was trained on speech corpora of American English and Serbo-Croatian. A range of experiments including subjective evaluation of synthesis was carried out to establish both the quality of synthesis in different scenarios and under different conditions, as well as the similarity of speaker voices between cross-lingual and original language scenario.

Keywords:
Computer science Prosody Speech synthesis Natural language processing Speech recognition Chinese speech synthesis Style (visual arts) Context (archaeology) Speech corpus Similarity (geometry) Phrase Artificial intelligence

Metrics

2
Cited By
0.51
FWCI (Field Weighted Citation Impact)
23
Refs
0.65
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Speech Recognition and Synthesis
Physical Sciences →  Computer Science →  Artificial Intelligence
Natural Language Processing Techniques
Physical Sciences →  Computer Science →  Artificial Intelligence
Speech and dialogue systems
Physical Sciences →  Computer Science →  Artificial Intelligence
© 2026 ScienceGate Book Chapters — All rights reserved.