JOURNAL ARTICLE

A comparison of supervised and unsupervised cross-lingual speaker adaptation approaches for HMM-based speech synthesis

Abstract

The EMIME project aims to build a personalized speech-to-speech translator, such that spoken input of a user in one language is used to produce spoken output that still sounds like the user's voice however in another language. This distinctiveness makes unsupervised cross-lingual speaker adaptation one key to the project's success. So far, research has been conducted into unsupervised and cross-lingual cases separately by means of decision tree marginalization and HMM state mapping respectively. In this paper we combine the two techniques to perform unsupervised cross-lingual speaker adaptation. The performance of eight speaker adaptation systems (supervised vs. unsupervised, intra-lingual vs. cross-lingual) are compared using objective and subjective evaluations. Experimental results show the performance of unsupervised cross-lingual speaker adaptation is comparable to that of the supervised case in terms of spectrum adaptation in the EMIME scenario, even though automatically obtained transcriptions have a very high phoneme error rate.

Keywords:
Computer science Hidden Markov model Speech recognition Adaptation (eye) Speaker diarisation Artificial intelligence Natural language processing Word error rate Decision tree Unsupervised learning Optimal distinctiveness theory Speaker recognition Psychology

Metrics

30
Cited By
7.21
FWCI (Field Weighted Citation Impact)
8
Refs
0.97
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Speech Recognition and Synthesis
Physical Sciences →  Computer Science →  Artificial Intelligence
Speech and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing
Speech and dialogue systems
Physical Sciences →  Computer Science →  Artificial Intelligence
© 2026 ScienceGate Book Chapters — All rights reserved.