JOURNAL ARTICLE

Deep Audio-Visual Beamforming for Speaker Localization

Xinyuan QianQiquan ZhangGuohui GuanWei Xue

Year: 2022 Journal:   IEEE Signal Processing Letters Vol: 29 Pages: 1132-1136   Publisher: Institute of Electrical and Electronics Engineers

Abstract

Generalized Cross Correlation (GCC) is the most popular localization technique over the past decades and can be extended with the beamforming method e.g. Steered Response Power (SRP) when multiple microphone pairs exist. Considering the promising results of Deep Learning (DL) strategies over classical approaches, in this work, instead of directly using Generalized Cross Correlation (GCC), SRP is derived with the DL-learnt ideal correlation functions for each pair of a microphone array. To deploy visual information, we explore the Conditional Variational Auto-Encoder (CVAE) framework in which the audio generative process is conditioned on the visual features encoded by face detections. The vision-derived auxiliary correlation function eventually contributes to the back-end beamformer for improved localization performance. To the best of our knowledge, this is the first deep-generative audiovisual method for speaker localization. Experimental results demonstrate our superior performance over other competitive methods, especially when the speech signal is corrupted by noise.

Keywords:
Computer science Beamforming Microphone Microphone array Speech recognition Artificial intelligence Autoencoder Noise (video) Pattern recognition (psychology) Correlation Deep learning Image (mathematics) Mathematics

Metrics

14
Cited By
2.73
FWCI (Field Weighted Citation Impact)
38
Refs
0.87
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Speech and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing
Music and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing
Advanced Adaptive Filtering Techniques
Physical Sciences →  Engineering →  Computational Mechanics

Related Documents

JOURNAL ARTICLE

Egocentric Deep Multi-Channel Audio-Visual Active Speaker Localization

Hao JiangCalvin MurdockVamsi Krishna Ithapu

Journal:   2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Year: 2022 Pages: 10534-10542
JOURNAL ARTICLE

Audio-Visual Speaker Localization and Tracking

Zhao, Jinzheng

Journal:   Surrey Open Research repository (University of Surrey) Year: 2025
© 2026 ScienceGate Book Chapters — All rights reserved.