End-To-End Multi-Talker Overlapping Speech Recognition

Anshuman Tripathi; Lu Han; Haşim Sak

doi:10.1109/icassp40776.2020.9054328

ScienceGate Book Chapters

JOURNAL ARTICLE

End-To-End Multi-Talker Overlapping Speech Recognition

Anshuman Tripathi Lu Han Haşim Sak

Year: 2020 Pages: 6129-6133

DOI: 10.1109/icassp40776.2020.9054328

Get Full-Text PDF Get Analytical Report

Abstract

In this paper we present an end-to-end speech recognition system that can recognize single-channel speech where multiple talkers can speak at the same time (overlapping speech) by using a neural network model based on Recurrent Neural Network Transducer (RNN-T) architecture. We augment the conventional RNN-T architecture by including a masking model for separation of encoded audio features, and multiple label encoders to encode transcripts from different speakers. We use a masking L2 loss to prevent transcripts to align to wrong speakers' audio, and a speaker embedding loss to facilitate speaker tracking. We show that by using these additional training objectives, the proposed augmented RNN-T model can be trained with simulated overlapping speech data and can achieve a WER of 32% on words in overlapping speech segments from real-life telephone conversations. Our analysis of manual transcription task on the same test set shows that transcribing overlapping speech is hard even for humans who can get a WER of 37% compared to ground-truth.

Keywords:

Computer science Speech recognition Recurrent neural network Encoder Transcription (linguistics) Masking (illustration) ENCODE Voice activity detection Embedding Speech processing Artificial intelligence Artificial neural network

Metrics

Cited By

4.58

FWCI (Field Weighted Citation Impact)

Refs

0.96

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Speech and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Speech Recognition and Synthesis

Physical Sciences → Computer Science → Artificial Intelligence

Music and Audio Processing

Physical Sciences → Computer Science → Signal Processing

End-To-End Multi-Talker Overlapping Speech Recognition

Abstract

Metrics

Citation History

Topics

Related Documents

Streaming End-to-End Multi-Talker Speech Recognition

Improving End-to-End Single-Channel Multi-Talker Speech Recognition

Unsupervised Domain Adaptation on End-to-End Multi-Talker Overlapped Speech Recognition

End-to-End Audio-Visual Speech Recognition for Overlapping Speech

End-to-End Brain-Driven Speech Enhancement in Multi-Talker Conditions