In this paper, two multi-stream asynchrony Dynamic Bayesian Network (DBN) model: MSADBN model and MM-ADBN model, are proposed for small vocabulary and large vocabulary audio-visual speech recognition, which loose the limitation of asynchrony of the audio stream and visual stream to word level. Essentially, MS-ADBN model is a word model with word-phone-observation topology structure, whose recognition basic units are word, while MM-ADBN model is phone model with word-phone-state-observation topology structure, whose recognition basic units are phones. Speech recognition experiments are done on digit audio-vidio database and continuous audio-vidio database, results show that: MS-ADBN model has the highest recognition rate on digit audio-visual database; while for continuous audio-visual database, in clean speech environment, comparing with SA-MSHMM and MS-ADBN model, the improvements of 35.91% and 9.97% are obtained for MM-ADBN model in speech recognition rate. In the future work, we
Guoyun LvDongmei JiangRongchun ZhaoYunshu Hou
Guillaume GravierGerasimos PotamianosC. Neti
Guoyun LvDongmei JiangRongchun ZhaoZetao JiangHichem Sahli
Virginia EstellersJean‐Philippe Thiran
Etienne MarcheretVit LibalGerasimos Potamianos