JOURNAL ARTICLE

Period VITS: Variational Inference with Explicit Pitch Modeling for End-To-End Emotional Speech Synthesis

Abstract

Several fully end-to-end text-to-speech (TTS) models have been proposed that have shown better performance compared to cascade models (i.e., training acoustic and vocoder models separately). However, they often generate unstable pitch contour with audible artifacts when the dataset contains emotional attributes, i.e., large diversity of pronunciation and prosody. To address this problem, we propose Period VITS, a novel end-to-end TTS model that incorporates an explicit periodicity generator. In the proposed method, we introduce a frame pitch predictor that predicts prosodic features, such as pitch and voicing flags, from the input text. From these features, the proposed periodicity generator produces a sample-level sinusoidal source that enables the waveform decoder to accurately reproduce the pitch. Finally, the entire model is jointly optimized in an end-to-end manner with variational inference and adversarial objectives. As a result, the decoder becomes capable of generating more stable, expressive, and natural output waveforms. The experimental results showed that the proposed model significantly outperforms baseline models in terms of naturalness, with improved pitch stability in the generated samples.

Keywords:
Computer science Naturalness Speech recognition Prosody Speech synthesis Waveform Cascade Generator (circuit theory) Jitter Mean opinion score End-to-end principle Voice Artificial intelligence Power (physics)

Metrics

11
Cited By
2.81
FWCI (Field Weighted Citation Impact)
40
Refs
0.89
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Speech Recognition and Synthesis
Physical Sciences →  Computer Science →  Artificial Intelligence
Speech and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing
Music and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing

Related Documents

JOURNAL ARTICLE

EVASS: Emotional Variational End-to-End Speech Synthesis with Semi-Supervised and Adverserial Learning

Mohamed Osman

Journal:   2022 2nd International Mobile, Intelligent, and Ubiquitous Computing Conference (MIUCC) Year: 2022 Vol: 30 Pages: 97-103
JOURNAL ARTICLE

Fast Inference End-to-End Speech Synthesis with Style Diffusion

Hui SunJ Q SongYi Jiang

Journal:   Electronics Year: 2025 Vol: 14 (14)Pages: 2829-2829
BOOK-CHAPTER

PiCo-VITS: Leveraging Pitch Contours for Fine-Grained Emotional Speech Synthesis

Kwan-yeung WongFu-Lai Chung

Lecture notes in computer science Year: 2024 Pages: 210-221
© 2026 ScienceGate Book Chapters — All rights reserved.