Accented Text-to-Speech Synthesis With Limited Data

Xuehao Zhou; Mingyang Zhang; Yi Zhou; Zhizheng Wu; Haizhou Li

doi:10.1109/taslp.2024.3363414

ScienceGate Book Chapters

JOURNAL ARTICLE

Accented Text-to-Speech Synthesis With Limited Data

Xuehao Zhou Mingyang Zhang Yi Zhou Zhizheng Wu Haizhou Li

Year: 2024 Journal: IEEE/ACM Transactions on Audio Speech and Language Processing Vol: 32 Pages: 1699-1711 Publisher: Institute of Electrical and Electronics Engineers

DOI: 10.1109/taslp.2024.3363414

Get Full-Text PDF Get Analytical Report

Abstract

This paper presents an accented text-to-speech (TTS) synthesis framework with limited training data. We study two aspects concerning accent rendering: phonetic (phoneme difference) and prosodic (pitch pattern and phoneme duration) variations. The proposed accented TTS framework consists of two models: an accented front-end for grapheme-to-phoneme (G2P) conversion and an accented acoustic model with integrated pitch and duration predictors for phoneme-to-Mel-spectrogram prediction. The accented front-end directly models the phonetic variation, while the accented acoustic model explicitly controls the prosodic variation. Specifically, both models are first pretrained on a large amount of data, then only the accent-related layers are fine-tuned on a limited amount of data for the target accent. In the experiments, speech data of three English accents, i.e., General American English, Irish English, and British English Received Pronunciation, are used for pre-training. The pretrained models are then fine-tuned with Scottish and General Australian English accents, respectively. Both objective and subjective evaluation results show that the accented TTS frontend fine-tuned with a small accented phonetic lexicon (5k words) effectively handles the phonetic variation of accents, while the accented TTS acoustic model fine-tuned with a limited amount of accented speech data (approximately 3 minutes) effectively improves the prosodic rendering including pitch and duration. The overall accent modeling contributes to improved speech quality and accent similarity.

Keywords:

Computer science Speech synthesis Natural language processing Speech recognition Linguistics Artificial intelligence Philosophy

Metrics

Cited By

7.03

FWCI (Field Weighted Citation Impact)

Refs

0.95

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Speech Recognition and Synthesis

Physical Sciences → Computer Science → Artificial Intelligence

Speech and dialogue systems

Physical Sciences → Computer Science → Artificial Intelligence

Natural Language Processing Techniques

Physical Sciences → Computer Science → Artificial Intelligence

Accented Text-to-Speech Synthesis With Limited Data

Abstract

Metrics

Citation History

Topics

Related Documents

Accented Text-to-Speech Synthesis with a Conditional Variational Autoencoder

Improving Accented Speech Recognition Using Data Augmentation Based on Unsupervised Text-to-Speech Synthesis

Controllable Accented Text-to-Speech Synthesis With Fine and Coarse-Grained Intensity Rendering

DNN Based Expressive Text-to-Speech with Limited Training Data

Explicit Intensity Control for Accented Text-to-speech