Fine-Grained Style Control In Transformer-Based Text-To-Speech Synthesis

Liwei Chen; Alexander I. Rudnicky

doi:10.1109/icassp43922.2022.9747747

ScienceGate Book Chapters

JOURNAL ARTICLE

Fine-Grained Style Control In Transformer-Based Text-To-Speech Synthesis

Liwei Chen Alexander I. Rudnicky

Year: 2022 Journal: ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) Pages: 7907-7911

DOI: 10.1109/icassp43922.2022.9747747

Get Full-Text PDF Get Analytical Report

Abstract

In this paper, we present a novel architecture to realize fine-grained style control on the transformer-based text-to-speech synthesis (TransformerTTS). Specifically, we model the speaking style by extracting a time sequence of local style tokens (LST) from the reference speech. The existing content encoder in TransformerTTS is then replaced by our designed cross-attention blocks for fusion and alignment between content and style. As the fusion is performed along with the skip connection, our cross-attention block provides a good inductive bias to gradually infuse the phoneme representation with a given style. Additionally, we prevent the style embedding from encoding linguistic content by randomly truncating LST during training and using wav2vec 2.0 features. Experiments show that with fine-grained style control, our system performs better in terms of naturalness, intelligibility, and style transferability. Our code and samples are publicly available. ¹

Keywords:

Computer science Naturalness Transformer Encoder Speech recognition Intelligibility (philosophy) Style (visual arts) Natural language processing Speech synthesis Artificial intelligence Engineering

Metrics

Cited By

3.06

FWCI (Field Weighted Citation Impact)

Refs

0.92

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Speech Recognition and Synthesis

Physical Sciences → Computer Science → Artificial Intelligence

Natural Language Processing Techniques

Physical Sciences → Computer Science → Artificial Intelligence

Speech and dialogue systems

Physical Sciences → Computer Science → Artificial Intelligence

Fine-Grained Style Control In Transformer-Based Text-To-Speech Synthesis

Abstract

Metrics

Citation History

Topics

Related Documents

Fine-Grained Style Control in VITS-Based Text-to-Speech Synthesis

Fine-Grained Prosody Transfer Text-to-Speech Synthesis with Transformer

RFGETT-TTS: Robust Fine-Grained Expressivity Transfer With Transformer for Text-to-Speech Synthesis

High-Acoustic Fidelity Text To Speech Synthesis With Fine-Grained Control Of Speech Attributes

Fine-Grained Style Modeling, Transfer and Prediction in Text-to-Speech Synthesis via Phone-Level Content-Style Disentanglement