Visual information has been shown to improve the performance of speech recognition systems in noisy acoustic environments. However, most audio-visual speech recognizers rely on a clean visual signal. In this paper, we explore a novel approach to visual speech modeling, based on articulatory features, which has potential benefits under visually challenging conditions. The idea is to use a set of parallel classifiers to extract different articulatory attributes from the input images, and then combine their decisions to obtain higher-level units, such as visemes or words. We evaluate our approach in a preliminary experiment on a small audio-visual database, using several image noise conditions, and compare it to the standard viseme-based modeling approach.
Zhuan-ling ZhaJin HuQingran ZhanYahui ShanXiang XieJing WangHaobo Cheng
Jisung WangHaram LeeMyungwoo Oh
Vikramjit MitraHosung NamCarol Espy-WilsonElliot SaltzmanLouis Goldstein