Audio-Visual Multi-Channel Recognition of Overlapped Speech

Jianwei Yu; Bo Wu; Rongzhi Gu; Shi-Xiong Zhang; Lianwu Chen; Yong Xu; Yu Meng; Dan Su; Dong Yu; Xunying Liu; Helen Meng

doi:10.21437/interspeech.2020-2346

ScienceGate Book Chapters

JOURNAL ARTICLE

Audio-Visual Multi-Channel Recognition of Overlapped Speech

Jianwei Yu Bo Wu Rongzhi Gu Shi-Xiong Zhang Lianwu Chen Yong Xu Yu Meng Dan Su Dong Yu Xunying Liu Helen Meng

Year: 2020 Pages: 3496-3500

DOI: 10.21437/interspeech.2020-2346

Get Full-Text PDF Get Analytical Report

Abstract

Automatic speech recognition (ASR) of overlapped speech remains a highly challenging task to date. To this end, multi-channel microphone array data are widely used in state-of-the-art ASR systems. Motivated by the invariance of visual modality to acoustic signal corruption, this paper presents an audio-visual multi-channel overlapped speech recognition system featuring tightly integrated separation front-end and recognition back-end. A series of audio-visual multi-channel speech separation front-end components based on \textit{TF masking}, \textit{filter\&sum} and \textit{mask-based MVDR} beamforming approaches were developed. To reduce the error cost mismatch between the separation and recognition components, they were jointly fine-tuned using the connectionist temporal classification (CTC) loss function, or a multi-task criterion interpolation with scale-invariant signal to noise ratio (Si-SNR) error cost. Experiments suggest that the proposed multi-channel AVSR system outperforms the baseline audio-only ASR system by up to 6.81\% (26.83\% relative) and 22.22\% (56.87\% relative) absolute word error rate (WER) reduction on overlapped speech constructed using either simulation or replaying of the lipreading sentence 2 (LRS2) dataset respectively.

Keywords:

Computer science Speech recognition Audio visual Channel (broadcasting) Audio mining Speech coding Voice activity detection Speech processing Computer network Multimedia

Metrics

Cited By

2.66

FWCI (Field Weighted Citation Impact)

Refs

0.91

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Speech and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Advanced Adaptive Filtering Techniques

Physical Sciences → Engineering → Computational Mechanics

Music and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Audio-Visual Multi-Channel Recognition of Overlapped Speech

Abstract

Metrics

Citation History

Topics

Related Documents

Audio-Visual Multi-Channel Integration and Recognition of Overlapped Speech

Audio-Visual Multi-Channel Speech Separation, Dereverberation and Recognition

Channel-Wise AV-Fusion Attention for Multi-Channel Audio-Visual Speech Recognition

Audio-Visual Recognition of Overlapped Speech for the LRS2 Dataset

Multi-Channel Overlapped Speech Recognition with Location Guided Speech Extraction Network