This article addresses the problem of continuous speech recognition from visual information only, without exploiting any audio signal. Our approach combines a video camera and an ultrasound imaging system for monitoring simultaneously the speaker's lips and the movement of the tongue. We investigate the use of convolutional neural networks (CNN) to extract visual features directly from the raw ultrasound and video images. We propose different architectures among which a multimodal CNN processing jointly the two visual modalities. Combined with an HMM-GMM decoder, the CNN-based approach outperforms our previous baseline based on Principal Component Analysis. Importantly, the recognition accuracy is only 4% lower than the one obtained when decoding the audio signal, which makes it a good candidate for a practical visual speech recognition system.
Panikos HeracleousYasser MohammadAkio Yoneyama
Ceren BelhanDamla FikirdanisOvgu CimenPelin PasinliZeynep AkgünZeynep Ovgu YayciMehmet Türkan
Jen-Cheng HouSyu‐Siang WangYing-Hui LaiYu TsaoHsiu-Wen ChangHsin‐Min Wang