JOURNAL ARTICLE

Audio-Visual End-to-End Multi-Channel Speech Separation, Dereverberation and Recognition

Guinan LiJiajun DengMengzhe GengZengrui JinTianzi WangShujie HuMingyu CuiHelen MengXunying Liu

Year: 2023 Journal:   IEEE/ACM Transactions on Audio Speech and Language Processing Vol: 31 Pages: 2707-2723   Publisher: Institute of Electrical and Electronics Engineers

Abstract

Accurate recognition of cocktail party speech containing overlapping speakers, noise and reverberation remains a highly challenging task to date. Motivated by the invariance of visual modality to acoustic signal corruption, an audio-visual multi-channel speech separation, dereverberation and recognition approach featuring a full incorporation of visual information into all system components is proposed in this paper. The efficacy of the video input is consistently demonstrated in mask-based MVDR speech separation, DNN-WPE or spectral mapping (SpecM) based speech dereverberation front-end and Conformer ASR back-end. Audio-visual integrated front-end architectures performing speech separation and dereverberation in a pipelined or joint fashion via mask-based WPD are investigated. The error cost mismatch between the speech enhancement front-end and ASR back-end components is minimized by end-to-end jointly fine-tuning using either the ASR cost function alone, or its interpolation with the speech enhancement loss. Experiments were conducted on the mixture overlapped and reverberant speech data constructed using simulation or replay of the Oxford LRS2 dataset. The proposed audio-visual multi-channel speech separation, dereverberation and recognition systems consistently outperformed the comparable audio-only baseline by 9.1% and 6.2% absolute (41.7% and 36.0% relative) word error rate (WER) reductions. Consistent speech enhancement improvements were also obtained on PESQ, STOI and SRMR scores.

Keywords:
Speech recognition Computer science Reverberation Front and back ends PESQ End-to-end principle Speech enhancement Artificial intelligence Noise reduction Acoustics

Metrics

15
Cited By
4.03
FWCI (Field Weighted Citation Impact)
103
Refs
0.93
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Speech and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing
Advanced Adaptive Filtering Techniques
Physical Sciences →  Engineering →  Computational Mechanics
Blind Source Separation Techniques
Physical Sciences →  Computer Science →  Signal Processing

Related Documents

JOURNAL ARTICLE

Audio-Visual Multi-Channel Speech Separation, Dereverberation and Recognition

Guinan LiJianwei YuJiajun DengXunying LiuHelen Meng

Journal:   ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) Year: 2022 Pages: 6042-6046
JOURNAL ARTICLE

MIMO-Speech: End-to-End Multi-Channel Multi-Speaker Speech Recognition

Xuankai ChangWangyou ZhangYanmin QianJonathan Le RouxShinji Watanabe

Journal:   2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Year: 2019 Pages: 237-244
© 2026 ScienceGate Book Chapters — All rights reserved.