JOURNAL ARTICLE

3D audio-visual speaker tracking with an adaptive particle filter

Abstract

We propose an audio-visual fusion algorithm for 3D speaker tracking from a localised multi-modal sensor platform composed of a camera and a small microphone array. After extracting audio-visual cues from individual modalities we fuse them adaptively using their reliability in a particle filter framework. The reliability of the audio signal is measured based on the maximum Global Coherence Field (GCF) peak value at each frame. The visual reliability is based on colour-histogram matching with detection results compared with a reference image in the RGB space. Experiments on the AV16.3 dataset show that the proposed adaptive audio-visual tracker outperforms both the individual modalities and a classical approach with fixed parameters in terms of tracking accuracy.

Keywords:
Computer science Artificial intelligence Computer vision Particle filter Histogram Microphone Reliability (semiconductor) Tracking (education) Microphone array Filter (signal processing) Speech recognition Image (mathematics)

Metrics

29
Cited By
3.24
FWCI (Field Weighted Citation Impact)
31
Refs
0.92
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Speech and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing
Music and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing
Advanced Adaptive Filtering Techniques
Physical Sciences →  Engineering →  Computational Mechanics
© 2026 ScienceGate Book Chapters — All rights reserved.