HMM-based audio-visual speech recognition (AVSR) systems have shown success in continuous speech recognition by combining visual and audio information, especially in noisy environments. In this paper we study how to improve decision trees used to create context classes in HMM-based AVSR systems. Traditionally, visual models have been trained with the same context classes as the audio only models. In this paper we investigate the use of separate decision trees to model the context classes for the audio and visual streams independently. Additionally we investigate the use of viseme classes in the decision tree building for the visual stream. On experiments with a 37-speaker 1.5 hours test set (about 12000 words) of continuous digits in noise, we obtain about a 3% absolute (20% relative) gain on AVSR performance by using separate decision trees for the audio and visual streams when using viseme classes in decision tree building for the visual stream.
Jing HuangKarthik Visweswariah
Etienne MarcheretStephen M. ChuVaibhava GoelGerasimos Potamianos
David DeanPatrick LuceySridha SridharanTim Wark
Jing HuangEtienne MarcheretKarthik Visweswariah
Ahmed Hussen AbdelazizSteffen ZeilerDorothea Kolossa