End-To-End Multi-Speaker Speech Recognition With Transformer

Xuankai Chang; Wangyou Zhang; Yanmin Qian; Jonathan Le Roux; Shinji Watanabe

doi:10.1109/icassp40776.2020.9054029

ScienceGate Book Chapters

JOURNAL ARTICLE

End-To-End Multi-Speaker Speech Recognition With Transformer

Xuankai Chang Wangyou Zhang Yanmin Qian Jonathan Le Roux Shinji Watanabe

Year: 2020 Pages: 6134-6138

DOI: 10.1109/icassp40776.2020.9054029

Get Full-Text PDF Get Analytical Report

Abstract

Recently, fully recurrent neural network (RNN) based end-to-end models have been proven to be effective for multi-speaker speech recognition in both the single-channel and multi-channel scenarios. In this work, we explore the use of Transformer models for these tasks by focusing on two aspects. First, we replace the RNN-based encoder-decoder in the speech recognition model with a Transformer architecture. Second, in order to use the Transformer in the masking network of the neural beamformer in the multi-channel case, we modify the self-attention component to be restricted to a segment rather than the whole sequence in order to reduce computation. Besides the model architecture improvements, we also incorporate an external dereverberation preprocessing, the weighted prediction error (WPE), enabling our model to handle reverberated signals. Experiments on the spatialized wsj1-2mix corpus show that the Transformer-based models achieve 40.9% and 25.6% relative WER reduction, down to 12.1% and 6.4% WER, under the anechoic condition in single-channel and multi-channel tasks, respectively, while in the reverberant case, our methods achieve 41.5% and 13.8% relative WER reduction, down to 16.5% and 15.2% WER.

Keywords:

Computer science Transformer Speech recognition Recurrent neural network Encoder End-to-end principle Computation Preprocessor Artificial neural network Artificial intelligence Pattern recognition (psychology) Algorithm Engineering Voltage

Metrics

Cited By

10.64

FWCI (Field Weighted Citation Impact)

Refs

0.99

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Speech and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Speech Recognition and Synthesis

Physical Sciences → Computer Science → Artificial Intelligence

Music and Audio Processing

Physical Sciences → Computer Science → Signal Processing

End-To-End Multi-Speaker Speech Recognition With Transformer

Abstract

Metrics

Citation History

Topics

Related Documents

End-to-End Multi-Speaker Speech Recognition

End-to-End Multilingual Multi-Speaker Speech Recognition

MIMO-Speech: End-to-End Multi-Channel Multi-Speaker Speech Recognition

End-to-End Multi-Channel Transformer for Speech Recognition

Real-Time End-to-End Monaural Multi-Speaker Speech Recognition