Asynchrony modeling for audio-visual speech recognition

Guillaume Gravier; Gerasimos Potamianos; C. Neti

doi:10.3115/1289189.1289244

ScienceGate Book Chapters

JOURNAL ARTICLE

Asynchrony modeling for audio-visual speech recognition

Guillaume Gravier Gerasimos Potamianos C. Neti

Year: 2002 Pages: 1-1

DOI: 10.3115/1289189.1289244

Get Full-Text PDF Get Analytical Report

Abstract

We investigate the use of multi-stream HMMs in the automatic recognition of audio-visual speech. Multi-stream HMMs allow the modeling of asynchrony between the audio and visual state sequences at a variety of levels (phone, syllable, word, etc.) and are equivalent to product, or composite, HMMs. In this paper, we consider such models synchronized at the phone boundary level, allowing various degrees of audio and visual state-sequence asynchrony. Furthermore, we investigate joint training of all product HMM parameters, instead of just composing the model from separately trained audio- and visual-only HMMs. We report experiments on a multi-subject connected digit recognition task, as well as on a more complex, speaker-independent large-vocabulary dictation task. Our results demonstrate that in both cases, joint multistream HMM training is superior to separate training of singlestream HMMs. In addition, we observe that allowing state-sequence asynchrony between the HMM audio and visual components improves connected digit recognition significantly, however it degrades performance on the dictation task. The resulting multi-stream models dramatically improve speech recognition robustness to noise, by successfully exploiting the visual modality speech information: For example, at 11 dB SNR, they reduce connected digit word error rate from the audio-only 2.3% to 0.77% audio-visual, and, for the largevocabulary task, from 28.3% to 19.5%. Compared to the audioonly performance at 10 dB SNR, the use of multi-stream HMMs achieves an effective SNR gain of up to 9 dB and 7 dB respectively, for the two recognition tasks considered.

Keywords:

Speech recognition Computer science Hidden Markov model Dictation Word error rate Robustness (evolution) Artificial intelligence Vocabulary

Metrics

Cited By

2.86

FWCI (Field Weighted Citation Impact)

Refs

0.91

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Speech and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Music and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Blind Source Separation Techniques

Physical Sciences → Computer Science → Signal Processing

Asynchrony modeling for audio-visual speech recognition

Abstract

Metrics

Citation History

Topics

Related Documents

Multi-stream Asynchrony Modeling for Audio-Visual Speech Recognition

Multi-Stream Asynchrony Modeling for Audio Visual Speech Recognition

Overcoming asynchrony in Audio-Visual Speech Recognition

Audio-visual speech asynchrony modeling in a talking head

Audio-visual speech modeling for continuous speech recognition