To generate optimal multi-stream audio-visual speech recognition performance, appropriate dynamic weighting of each modality is desired. In this paper, we propose to estimate such weights based on a combination of acoustic signal space observations and single-modality audio and visual speech model likelihoods. Two modeling approaches are investigated for such weight estimation: one based on a sigmoid fitting function, the other employing Gaussian mixture models. Reported experiments demonstrate that the later approach outperforms sigmoid based modeling, and is dramatically superior to the static weighting scheme.
Virginia EstellersMihai GurbanJean‐Philippe Thiran
Guoyun LvDongmei JiangRongchun ZhaoYunshu Hou
Guoyun LvYangyu FanDongmei JiangRongchun Zhao
Ali S. SaudiMahmoud I. KhalilHazem M. Abbas