JOURNAL ARTICLE

Automatically learning speaker-independent acoustic subword units

Abstract

We investigate methods for unsupervised learning of sub-word acoustic units of a language directly from speech. We demonstrate that states of a hidden Markov model “grown” using a novel modification of the maximum likelihood successive state splitting algorithm correspond very well with the phones of the language. In particular, the correspondence between the Viterbi state sequence for unseen speech from the training speaker and the phone transcription of the speech is over 85%, and generalizes to a large extent (∼ 63%) to speech from a different speaker. Furthermore, we are able to bridge more than half the gap between the speaker-dependent and cross-speaker correspondence of the automatically learned units to phones (∼ 75% accuracy) by unsupervised adaptation via MLLR.

Keywords:
Computer science Speech recognition Viterbi algorithm Hidden Markov model Speaker diarisation Phone Transcription (linguistics) Artificial intelligence Sequence (biology) Unsupervised learning Acoustic model Word (group theory) Speaker recognition Natural language processing Speech processing Mathematics Linguistics

Metrics

7
Cited By
0.00
FWCI (Field Weighted Citation Impact)
11
Refs
0.01
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Speech Recognition and Synthesis
Physical Sciences →  Computer Science →  Artificial Intelligence
Speech and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing
Speech and dialogue systems
Physical Sciences →  Computer Science →  Artificial Intelligence
© 2026 ScienceGate Book Chapters — All rights reserved.