JOURNAL ARTICLE

Adversarial Learning of Intermediate Acoustic Feature for End-to-End Lightweight Text-to-Speech

Abstract

To simplify the generation process, several text-to-speech (TTS) systems implicitly learn intermediate latent representations instead of relying on predefined features (e.g., mel-spectrogram).However, their generation quality is unsatisfactory as these representations lack speech variances.In this paper, we improve TTS performance by adding prosody embeddings to the latent representations.During training, we extract reference prosody embeddings from mel-spectrograms, and during inference, we estimate these embeddings from text using generative adversarial networks (GANs).Using GANs, we reliably estimate the prosody embeddings in a fast way, which have complex distributions due to the dynamic nature of speech.We also show that the prosody embeddings work as efficient features for learning a robust alignment between text and acoustic features.Our proposed model surpasses several publicly available models with less parameters and computational complexity in comparative experiments.

Keywords:
Prosody Spectrogram Computer science Generative grammar Inference Speech recognition Speech synthesis Feature (linguistics) Process (computing) Artificial intelligence Generative model End-to-end principle Acoustic model Adversarial system Natural language processing Speech processing Linguistics

Metrics

0
Cited By
0.00
FWCI (Field Weighted Citation Impact)
23
Refs
0.10
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Topics

Speech Recognition and Synthesis
Physical Sciences →  Computer Science →  Artificial Intelligence
Speech and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing
Music and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing

Related Documents

© 2026 ScienceGate Book Chapters — All rights reserved.