Adversarial Learning of Intermediate Acoustic Feature for End-to-End Lightweight Text-to-Speech

Hyungchan Yoon; Seyun Um; Changhwan Kim; Hong-Goo Kang

doi:10.21437/interspeech.2023-1571

ScienceGate Book Chapters

JOURNAL ARTICLE

Adversarial Learning of Intermediate Acoustic Feature for End-to-End Lightweight Text-to-Speech

Hyungchan Yoon Seyun Um Changhwan Kim Hong-Goo Kang

Year: 2023 Pages: 3023-3027

DOI: 10.21437/interspeech.2023-1571

Get Full-Text PDF Get Analytical Report

Abstract

To simplify the generation process, several text-to-speech (TTS) systems implicitly learn intermediate latent representations instead of relying on predefined features (e.g., mel-spectrogram).However, their generation quality is unsatisfactory as these representations lack speech variances.In this paper, we improve TTS performance by adding prosody embeddings to the latent representations.During training, we extract reference prosody embeddings from mel-spectrograms, and during inference, we estimate these embeddings from text using generative adversarial networks (GANs).Using GANs, we reliably estimate the prosody embeddings in a fast way, which have complex distributions due to the dynamic nature of speech.We also show that the prosody embeddings work as efficient features for learning a robust alignment between text and acoustic features.Our proposed model surpasses several publicly available models with less parameters and computational complexity in comparative experiments.

Keywords:

Prosody Spectrogram Computer science Generative grammar Inference Speech recognition Speech synthesis Feature (linguistics) Process (computing) Artificial intelligence Generative model End-to-end principle Acoustic model Adversarial system Natural language processing Speech processing Linguistics

Metrics

Cited By

0.00

FWCI (Field Weighted Citation Impact)

Refs

0.10

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Topics

Speech Recognition and Synthesis

Physical Sciences → Computer Science → Artificial Intelligence

Speech and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Music and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Adversarial Learning of Intermediate Acoustic Feature for End-to-End Lightweight Text-to-Speech

Abstract

Metrics

Topics

Related Documents

Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

Adversarial Learning For End-To-End Cochlear Speech Denoising Using Lightweight Deep Learning Models

Lightweight End-to-End Speech Enhancement Generative Adversarial Network Using Sinc Convolutions

End-to-End Text-to-Speech for Minangkabau Pariaman Dialect Using Variational Autoencoder with Adversarial Learning (VITS)