Lei XieRong-Chun ZhaoZhi-Qiang Liu
This paper proposes an adaptive stream reliability modeling technique for audio visual speech recognition (AVSR). As recognition conditions vary locally, we present two local measures - frame and window dispersions to depict the temporal discriminative powers and noise levels of both audio and visual streams. The dispersions are subsequently mapped to stream exponents according to the minimum classification error (MCE) criterion. Experiments on a connected-digits task show that our method consistently outperforms the popular discriminative training (DT) and grid search (GS) methods at various signal noise ratios (SNRs), improving for example word accuracy rate (WAR) from 94.7% to 96.4% at 28dB SNR.
Martin HeckmannFrédéric BerthommierKristian Kroschel
Guoyun LvDongmei JiangRongchun ZhaoYunshu Hou
Etienne MarcheretVit LibalGerasimos Potamianos
Guoyun LvYangyu FanDongmei JiangRongchun Zhao