JOURNAL ARTICLE

End-To-End Multi-Talker Overlapping Speech Recognition

Abstract

In this paper we present an end-to-end speech recognition system that can recognize single-channel speech where multiple talkers can speak at the same time (overlapping speech) by using a neural network model based on Recurrent Neural Network Transducer (RNN-T) architecture. We augment the conventional RNN-T architecture by including a masking model for separation of encoded audio features, and multiple label encoders to encode transcripts from different speakers. We use a masking L2 loss to prevent transcripts to align to wrong speakers' audio, and a speaker embedding loss to facilitate speaker tracking. We show that by using these additional training objectives, the proposed augmented RNN-T model can be trained with simulated overlapping speech data and can achieve a WER of 32% on words in overlapping speech segments from real-life telephone conversations. Our analysis of manual transcription task on the same test set shows that transcribing overlapping speech is hard even for humans who can get a WER of 37% compared to ground-truth.

Keywords:
Computer science Speech recognition Recurrent neural network Encoder Transcription (linguistics) Masking (illustration) ENCODE Voice activity detection Embedding Speech processing Artificial intelligence Artificial neural network

Metrics

38
Cited By
4.58
FWCI (Field Weighted Citation Impact)
19
Refs
0.96
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Speech and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing
Speech Recognition and Synthesis
Physical Sciences →  Computer Science →  Artificial Intelligence
Music and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing

Related Documents

JOURNAL ARTICLE

Streaming End-to-End Multi-Talker Speech Recognition

Liang LuNaoyuki KandaJinyu LiYifan Gong

Journal:   IEEE Signal Processing Letters Year: 2021 Vol: 28 Pages: 803-807
JOURNAL ARTICLE

Improving End-to-End Single-Channel Multi-Talker Speech Recognition

Wangyou ZhangXuankai ChangYanmin QianShinji Watanabe

Journal:   IEEE/ACM Transactions on Audio Speech and Language Processing Year: 2020 Vol: 28 Pages: 1385-1394
JOURNAL ARTICLE

Unsupervised Domain Adaptation on End-to-End Multi-Talker Overlapped Speech Recognition

Zheng LinZhu HanSanli TianQingwei ZhaoTa Li

Journal:   IEEE Signal Processing Letters Year: 2024 Vol: 31 Pages: 3119-3123
JOURNAL ARTICLE

End-to-End Brain-Driven Speech Enhancement in Multi-Talker Conditions

Maryam HosseiniLuca CelottiÉric Plourde

Journal:   IEEE/ACM Transactions on Audio Speech and Language Processing Year: 2022 Vol: 30 Pages: 1718-1733
© 2026 ScienceGate Book Chapters — All rights reserved.