JOURNAL ARTICLE

Egocentric Deep Multi-Channel Audio-Visual Active Speaker Localization

Hao JiangCalvin MurdockVamsi Krishna Ithapu

Year: 2022 Journal:   2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Pages: 10534-10542

Abstract

Augmented reality devices have the potential to enhance human perception and enable other assistive functionalities in complex conversational environments. Effectively capturing the audio-visual context necessary for understanding these social interactions first requires detecting and localizing the voice activities of the device wearer and the surrounding people. These tasks are challenging due to their egocentric nature: the wearer's head motion may cause motion blur, surrounding people may appear in difficult viewing angles, and there may be occlusions, visual clutter, audio noise, and bad lighting. Under these conditions, previous state-of-the-art active speaker detection methods do not give satisfactory results. Instead, we tackle the problem from a new setting using both video and multi-channel microphone array audio. We propose a novel end-to-end deep learning approach that is able to give robust voice activity detection and localization results. In contrast to previous methods, our method localizes active speakers from all possible directions on the sphere, even outside the camera's field of view, while simultaneously detecting the device wearer's own voice activity. Our experiments show that the proposed method gives superior results, can run in real time, and is robust against noise and clutter.

Keywords:
Computer science Clutter Computer vision Artificial intelligence Microphone Noise (video) Context (archaeology) Visualization Channel (broadcasting) Speech recognition Perception Augmented reality Radar Image (mathematics)

Metrics

36
Cited By
5.05
FWCI (Field Weighted Citation Impact)
36
Refs
0.96
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Speech and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing
Indoor and Outdoor Localization Technologies
Physical Sciences →  Engineering →  Electrical and Electronic Engineering
Music and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing

Related Documents

JOURNAL ARTICLE

Deep Audio-Visual Beamforming for Speaker Localization

Xinyuan QianQiquan ZhangGuohui GuanWei Xue

Journal:   IEEE Signal Processing Letters Year: 2022 Vol: 29 Pages: 1132-1136
JOURNAL ARTICLE

Audio-Visual Speaker Localization and Tracking

Zhao, Jinzheng

Journal:   Surrey Open Research repository (University of Surrey) Year: 2025
JOURNAL ARTICLE

AS-Net: active speaker detection using deep audio-visual attention

Abduljalil RadmanJorma Laaksonen

Journal:   Multimedia Tools and Applications Year: 2024 Vol: 83 (28)Pages: 72027-72042
© 2026 ScienceGate Book Chapters — All rights reserved.