Improving Unsupervised Style Transfer in end-to-end Speech Synthesis with end-to-end Speech Recognition

Da-Rong Liu; Chi-Yu Yang; Szu-Lin Wu; Hung-yi Lee

doi:10.1109/slt.2018.8639672

ScienceGate Book Chapters

JOURNAL ARTICLE

Improving Unsupervised Style Transfer in end-to-end Speech Synthesis with end-to-end Speech Recognition

Da-Rong Liu Chi-Yu Yang Szu-Lin Wu Hung-yi Lee

Year: 2018 Pages: 640-647

DOI: 10.1109/slt.2018.8639672

Get Full-Text PDF Get Analytical Report

Abstract

End-to-end TTS model can directly take an utterance as reference, and generate speech from the text with prosody and speaker characteristics similar to the reference utterance. Ideally, the transcription of reference utterance does not need to match the text to be synthesized, so unsupervised style transfer can be achieved. However, in the previous model, because only the matched text and speech are used in training, given unmatched text and speech during testing would make the model synthesize blurry speech. In this paper, we propose to mitigate the problem by using the unmatched text and speech during training, and using the ASR accuracy of an end-to-end ASR model to guide the training procedure. The experimental results show that with the guidance of end-to-end ASR, both the ASR accuracy (objective evaluation) and the listener preference (subjective evaluation) of the speech generated by TTS model are improved. Moreover, we propose attention consistency loss as regularization, which is shown to accelerate the training.

Keywords:

End-to-end principle Computer science Utterance Speech recognition Prosody Consistency (knowledge bases) Speech synthesis Artificial intelligence Acoustic model Natural language processing Speech processing

Metrics

Cited By

2.78

FWCI (Field Weighted Citation Impact)

Refs

0.91

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Speech Recognition and Synthesis

Physical Sciences → Computer Science → Artificial Intelligence

Speech and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Music and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Improving Unsupervised Style Transfer in end-to-end Speech Synthesis with end-to-end Speech Recognition

Abstract

Metrics

Citation History

Topics

Related Documents

Towards End-to-End Unsupervised Speech Recognition

Improving Tibetan End-To-End Speech Recognition with Transfer Learning

Improving End-to-End Speech Recognition with Policy Learning

End-to-End Speech Recognition

Improving Cross-Lingual Transfer Learning for End-to-End Speech Recognition with Speech Translation