Semi-Supervised Speaker Adaptation for End-to-End Speech Synthesis with Pretrained Models

Katsuki Inoue; Sunao Hara; Masanobu Abe; Tomoki Hayashi; Ryuichi Yamamoto; Shinji Watanabe

doi:10.1109/icassp40776.2020.9053371

ScienceGate Book Chapters

JOURNAL ARTICLE

Semi-Supervised Speaker Adaptation for End-to-End Speech Synthesis with Pretrained Models

Katsuki Inoue Sunao Hara Masanobu Abe Tomoki Hayashi Ryuichi Yamamoto Shinji Watanabe

Year: 2020 Pages: 7634-7638

DOI: 10.1109/icassp40776.2020.9053371

Get Full-Text PDF Get Analytical Report

Abstract

Recently, end-to-end text-to-speech (TTS) models have achieved a remarkable performance, however, requiring a large amount of paired text and speech data for training. On the other hand, we can easily collect unpaired dozen minutes of speech recordings for a target speaker without corresponding text data. To make use of such accessible data, the proposed method leverages the recent great success of state-of-the-art end-to-end automatic speech recognition (ASR) systems and obtains corresponding transcriptions from pretrained ASR models. Although these models could only provide text output instead of intermediate linguistic features like phonemes, end-to-end TTS can be well trained with such raw text data directly. Thus, the proposed method can greatly simplify a speaker adaptation pipeline by consistently employing end-to-end ASR/TTS ecosystems. The experimental results show that our proposed method achieved comparable performance to a paired data adaptation method in terms of subjective speaker similarity and objective cepstral distance measures.

Keywords:

Computer science Speech recognition End-to-end principle Pipeline (software) Adaptation (eye) Mel-frequency cepstrum Artificial intelligence Speaker recognition Similarity (geometry) Feature extraction

Metrics

Cited By

1.76

FWCI (Field Weighted Citation Impact)

Refs

0.87

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Speech Recognition and Synthesis

Physical Sciences → Computer Science → Artificial Intelligence

Speech and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Natural Language Processing Techniques

Physical Sciences → Computer Science → Artificial Intelligence

Semi-Supervised Speaker Adaptation for End-to-End Speech Synthesis with Pretrained Models

Abstract

Metrics

Citation History

Topics

Related Documents

Semi-Supervised End-to-End Speech Recognition

Semi-Supervised Learning Based on Hierarchical Generative Models for End-to-End Speech Synthesis

Enhancing End-to-End Speech Synthesis by Modeling Interrogative Sentences with Speaker Adaptation

Modeling Irregular Voice in End-to-End Speech Synthesis via Speaker Adaptation

Semi-supervised domain adaptation using unlabeled data for end-to-end speech recognition