Predicting Expressive Speaking Style from Text in End-To-End Speech Synthesis

Daisy Stanton; Yuxuan Wang; RJ Skerry-Ryan

doi:10.1109/slt.2018.8639682

ScienceGate Book Chapters

JOURNAL ARTICLE

Predicting Expressive Speaking Style from Text in End-To-End Speech Synthesis

Daisy Stanton Yuxuan Wang RJ Skerry-Ryan

Year: 2018 Pages: 595-602

DOI: 10.1109/slt.2018.8639682

Get Full-Text PDF Get Analytical Report

Abstract

Global Style Tokens (GSTs) are a recently-proposed method to learn latent disentangled representations of high-dimensional data. GSTs can be used within Tacotron, a state-of-the-art end-to-end text-to-speech synthesis system, to uncover expressive factors of variation in speaking style. In this work, we introduce the Text-Predicting Global Style Token (TP-GST) architecture, which treats GST combination weights or style embeddings as "virtual" speaking style labels within Tacotron. TP-GST learns to predict stylistic renderings from text alone, requiring neither explicit labels during training, nor auxiliary inputs for inference. We show that, when trained on an expressive speech dataset, our system can render text with more pitch and energy variation than two state-of-the-art baseline models. We further demonstrate that TP-GSTs can synthesize speech with background noise removed, and corroborate these analyses with positive results on human-rated listener preference audiobook tasks. Finally, we demonstrate that multi-speaker TP-GST models successfully factorize speaker identity and speaking style. We provide a website with audio samples ¹ for each of our findings.

Keywords:

End-to-end principle Computer science Style (visual arts) Speech synthesis Speech recognition Natural language processing Linguistics Artificial intelligence Art Literature Philosophy

Metrics

114

Cited By

12.31

FWCI (Field Weighted Citation Impact)

Refs

0.98

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Speech Recognition and Synthesis

Physical Sciences → Computer Science → Artificial Intelligence

Natural Language Processing Techniques

Physical Sciences → Computer Science → Artificial Intelligence

Speech and dialogue systems

Physical Sciences → Computer Science → Artificial Intelligence

Predicting Expressive Speaking Style from Text in End-To-End Speech Synthesis

Abstract

Metrics

Citation History

Topics

Related Documents

Learning Hierarchical Representations for Expressive Speaking Style in End-to-End Speech Synthesis

Expressive Text-to-Speech Synthesis using Text Chat Dataset with Speaking Style Information

CALM: Constrastive Cross-modal Speaking Style Modeling for Expressive Text-to-Speech Synthesis

Improving Unsupervised Style Transfer in end-to-end Speech Synthesis with end-to-end Speech Recognition

Exploiting Deep Sentential Context for Expressive End-to-End Speech Synthesis