Fast Inference End-to-End Speech Synthesis with Style Diffusion

Hui Sun; J Q Song; Yi Jiang

doi:10.3390/electronics14142829

ScienceGate Book Chapters

JOURNAL ARTICLE

Fast Inference End-to-End Speech Synthesis with Style Diffusion

Hui Sun J Q Song Yi Jiang

Year: 2025 Journal: Electronics Vol: 14 (14)Pages: 2829-2829 Publisher: Multidisciplinary Digital Publishing Institute

DOI: 10.3390/electronics14142829

Get Full-Text PDF Get Analytical Report

Abstract

In recent years, deep learning-based end-to-end Text-To-Speech (TTS) models have made significant progress in enhancing speech naturalness and fluency. However, existing Variational Inference Text-to-Speech (VITS) models still face challenges such as insufficient pitch modeling, inadequate contextual dependency capture, and low inference efficiency in the decoder. To address these issues, this paper proposes an improved TTS framework named Q-VITS. Q-VITS incorporates Rotary Position Embedding (RoPE) into the text encoder to enhance long-sequence modeling, adopts a frame-level prior modeling strategy to optimize one-to-many mappings, and designs a style extractor based on a diffusion model for controllable style rendering. Additionally, the proposed decoder ConfoGAN integrates explicit F0 modeling, Pseudo-Quadrature Mirror Filter (PQMF) multi-band synthesis and Conformer structure. The experimental results demonstrate that Q-VITS outperforms the VITS in terms of speech quality, pitch accuracy, and inference efficiency in both subjective Mean Opinion Score (MOS) and objective Mel-Cepstral Distortion (MCD) and Root Mean Square Error (RMSE) evaluations on a single-speaker dataset, achieving performance close to ground-truth audio. These improvements provide an effective solution for efficient and controllable speech synthesis.

Keywords:

End-to-end principle Speech synthesis Computer science Inference End user Diffusion Speech recognition Style (visual arts) End of history Artificial intelligence World Wide Web Art Literature Physics

Metrics

Cited By

0.00

FWCI (Field Weighted Citation Impact)

Refs

0.10

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Topics

Speech Recognition and Synthesis

Physical Sciences → Computer Science → Artificial Intelligence

Natural Language Processing Techniques

Physical Sciences → Computer Science → Artificial Intelligence

Speech and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Fast Inference End-to-End Speech Synthesis with Style Diffusion

Abstract

Metrics

Topics

Related Documents

Improving Unsupervised Style Transfer in end-to-end Speech Synthesis with end-to-end Speech Recognition

Speech Style Modeling Method Using Mutual Information for End-to-End Speech Synthesis

Predicting Expressive Speaking Style from Text in End-To-End Speech Synthesis

End-to-End Binaural Speech Synthesis

Period VITS: Variational Inference with Explicit Pitch Modeling for End-To-End Emotional Speech Synthesis