Multi-modal fusion is an important, yet challenging task for perceptual user interfaces. Humans routinely perform complex and simple tasks in which ambiguous auditory and visual data are combined in order to support accurate perception. By contrast, automated approaches for processing multi-modal data sources lag far behind. This is primarily due to the fact that few methods adequately model the complexity of the audio/visual relationship. We present an information theoretic approach for fusion of multiple modalities. Furthermore we discuss a statistical model for which our approach to fusion is justified. We present empirical results demonstrating audio-video localization and consistency measurement. We show examples determining where a speaker is within a scene, and whether they are producing the specified audio stream.
Giancarlo IannizzottoFrancesco La RosaCarlo CostanzoPietro Lanzafame
Giancarlo IannizzottoCarlo CostanzoFrancesco La RosaPietro Lanzafame
Vall eacute s MiguelArredondo Maria TeresaFrancisco del Pozo