We investigate the use of multi-stream HMMs in the automatic recognition of audio-visual speech. Multi-stream HMMs allow the modeling of asynchrony between the audio and visual state sequences at a variety of levels (phone, syllable, word, etc.) and are equivalent to product, or composite, HMMs. In this paper, we consider such models synchronized at the phone boundary level, allowing various degrees of audio and visual state-sequence asynchrony. Furthermore, we investigate joint training of all product HMM parameters, instead of just composing the model from separately trained audio- and visual-only HMMs. We report experiments on a multi-subject connected digit recognition task, as well as on a more complex, speaker-independent large-vocabulary dictation task. Our results demonstrate that in both cases, joint multistream HMM training is superior to separate training of singlestream HMMs. In addition, we observe that allowing state-sequence asynchrony between the HMM audio and visual components improves connected digit recognition significantly, however it degrades performance on the dictation task. The resulting multi-stream models dramatically improve speech recognition robustness to noise, by successfully exploiting the visual modality speech information: For example, at 11 dB SNR, they reduce connected digit word error rate from the audio-only 2.3% to 0.77% audio-visual, and, for the largevocabulary task, from 28.3% to 19.5%. Compared to the audioonly performance at 10 dB SNR, the use of multi-stream HMMs achieves an effective SNR gain of up to 9 dB and 7 dB respectively, for the two recognition tasks considered.
Guoyun LvDongmei JiangRongchun ZhaoYunshu Hou
Guoyun LvYangyu FanDongmei JiangRongchun Zhao
Virginia EstellersJean‐Philippe Thiran
Alexey KarpovLiliya TsirulnikZdeněk KrňoulAndrey RonzhinBoris LobanovMiloš Železný
Stéphane DupontJuergen Luettin