Abstract

Automatic speech recognition (ASR) of overlapped speech remains a highly challenging task to date. To this end, multi-channel microphone array data are widely used in state-of-the-art ASR systems. Motivated by the invariance of visual modality to acoustic signal corruption, this paper presents an audio-visual multi-channel overlapped speech recognition system featuring tightly integrated separation front-end and recognition back-end. A series of audio-visual multi-channel speech separation front-end components based on \textit{TF masking}, \textit{filter\&sum} and \textit{mask-based MVDR} beamforming approaches were developed. To reduce the error cost mismatch between the separation and recognition components, they were jointly fine-tuned using the connectionist temporal classification (CTC) loss function, or a multi-task criterion interpolation with scale-invariant signal to noise ratio (Si-SNR) error cost. Experiments suggest that the proposed multi-channel AVSR system outperforms the baseline audio-only ASR system by up to 6.81\% (26.83\% relative) and 22.22\% (56.87\% relative) absolute word error rate (WER) reduction on overlapped speech constructed using either simulation or replaying of the lipreading sentence 2 (LRS2) dataset respectively.

Keywords:
Computer science Speech recognition Audio visual Channel (broadcasting) Audio mining Speech coding Voice activity detection Speech processing Computer network Multimedia

Metrics

22
Cited By
2.66
FWCI (Field Weighted Citation Impact)
0
Refs
0.91
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Speech and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing
Advanced Adaptive Filtering Techniques
Physical Sciences →  Engineering →  Computational Mechanics
Music and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing

Related Documents

JOURNAL ARTICLE

Audio-Visual Multi-Channel Integration and Recognition of Overlapped Speech

Jianwei YuShi-Xiong ZhangBo WuShansong LiuShoukang HuMengzhe GengXunying LiuHelen MengDong Yu

Journal:   IEEE/ACM Transactions on Audio Speech and Language Processing Year: 2021 Vol: 29 Pages: 2067-2082
JOURNAL ARTICLE

Audio-Visual Multi-Channel Speech Separation, Dereverberation and Recognition

Guinan LiJianwei YuJiajun DengXunying LiuHelen Meng

Journal:   ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) Year: 2022 Pages: 6042-6046
JOURNAL ARTICLE

Channel-Wise AV-Fusion Attention for Multi-Channel Audio-Visual Speech Recognition

Gaopeng XuSong YangWei LiSong WangGuo WeiJunfeng YuanJie Gao

Journal:   ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) Year: 2022 Pages: 9251-9255
© 2026 ScienceGate Book Chapters — All rights reserved.