EMOQ-TTS: Emotion Intensity Quantization for Fine-Grained Controllable Emotional Text-to-Speech

Chae-Bin Im; Sang-Hoon Lee; Seung-Bin Kim; Seong‐Whan Lee

doi:10.1109/icassp43922.2022.9747098

ScienceGate Book Chapters

JOURNAL ARTICLE

EMOQ-TTS: Emotion Intensity Quantization for Fine-Grained Controllable Emotional Text-to-Speech

Chae-Bin Im Sang-Hoon Lee Seung-Bin Kim Seong‐Whan Lee

Year: 2022 Journal: ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) Pages: 6317-6321

DOI: 10.1109/icassp43922.2022.9747098

Get Full-Text PDF Get Analytical Report

Abstract

Although recent advances in text-to-speech (TTS) have shown significant improvement, it is still limited to emotional speech synthesis. To produce emotional speech, most works utilize emotion information extracted from emotion labels or reference audio. However, they result in monotonous emotional expression due to the utterance-level emotion conditions. In this paper, we propose EmoQ-TTS, which synthesizes expressive emotional speech by conditioning phoneme-wise emotion information with fine-grained emotion intensity. Here, the intensity of emotion information is rendered by distance-based intensity quantization without human labeling. We can also control the emotional expression of synthesized speech by conditioning intensity labels manually. The experimental results demonstrate the superiority of EmoQ-TTS in emotional expressiveness and controllability.

Keywords:

Utterance Computer science Speech recognition Controllability Quantization (signal processing) Speech synthesis Emotional expression Artificial intelligence Psychology Cognitive psychology Mathematics Computer vision

Metrics

Cited By

4.58

FWCI (Field Weighted Citation Impact)

Refs

0.95

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Speech Recognition and Synthesis

Physical Sciences → Computer Science → Artificial Intelligence

Speech and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Music and Audio Processing

Physical Sciences → Computer Science → Signal Processing

EMOQ-TTS: Emotion Intensity Quantization for Fine-Grained Controllable Emotional Text-to-Speech

Abstract

Metrics

Citation History

Topics

Related Documents

EmoSphere-TTS: Emotional Style and Intensity Modeling via Spherical Emotion Vector for Controllable Emotional Text-to-Speech

Controllable Accented Text-to-Speech Synthesis With Fine and Coarse-Grained Intensity Rendering

Leveraging Speech PTM, Text LLM, And Emotional TTS For Speech Emotion Recognition

VECL-TTS: Voice identity and Emotional style controllable Cross-Lingual Text-to-Speech

Fine-Grained Emotional Control of Text-to-Speech: Learning to Rank Inter- and Intra-Class Emotion Intensities