JOURNAL ARTICLE

Audio-Visual Speaker Localization Using Graphical Models

Abstract

In this work we propose an approach to combine audio and video modalities for person tracking using graphical models. We demonstrate a principled and intuitive framework for combining these modalities to obtain robustness against occlusion and change in appearance. We further exploit the temporal correlations that exist for a moving object between adjacent frames to account for the cases where having both modalities might still not be enough, e.g., when the person being tracked is occluded and not speaking. Improvement in tracking results is shown at each step and compared with manually annotated ground truth.

Keywords:
Computer science Robustness (evolution) Modalities Ground truth Artificial intelligence Computer vision Video tracking Exploit Graphical model Speech recognition Object (grammar)

Metrics

7
Cited By
0.96
FWCI (Field Weighted Citation Impact)
9
Refs
0.75
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Speech and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing
Music and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing
Video Surveillance and Tracking Methods
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition

Related Documents

JOURNAL ARTICLE

Audio-Visual Speaker Localization and Tracking

Zhao, Jinzheng

Journal:   Surrey Open Research repository (University of Surrey) Year: 2025
JOURNAL ARTICLE

Deep Audio-Visual Beamforming for Speaker Localization

Xinyuan QianQiquan ZhangGuohui GuanWei Xue

Journal:   IEEE Signal Processing Letters Year: 2022 Vol: 29 Pages: 1132-1136
© 2026 ScienceGate Book Chapters — All rights reserved.