Streaming End-to-End Multi-Talker Speech Recognition

Liang Lu; Naoyuki Kanda; Jinyu Li; Yifan Gong

doi:10.1109/lsp.2021.3070817

ScienceGate Book Chapters

JOURNAL ARTICLE

Streaming End-to-End Multi-Talker Speech Recognition

Liang Lu Naoyuki Kanda Jinyu Li Yifan Gong

Year: 2021 Journal: IEEE Signal Processing Letters Vol: 28 Pages: 803-807 Publisher: Institute of Electrical and Electronics Engineers

DOI: 10.1109/lsp.2021.3070817

Get Full-Text PDF Get Analytical Report

Abstract

End-to-end multi-talker speech recognition is an emerging research trend in the speech community due to its vast potential in applications such as conversation and meeting transcriptions. To the best of our knowledge, all existing research works are constrained in the offline scenario. In this work, we propose the Streaming Unmixing and Recognition Transducer (SURT) for end-to-end multi-talker speech recognition. Our model employs the Recurrent Neural Network Transducer (RNN-T) as the backbone that can meet various latency constraints. We study two different model architectures that are based on a speaker-differentiator encoder and a mask encoder respectively. To train this model, we investigate the widely used Permutation Invariant Training (PIT) approach and the Heuristic Error Assignment Training (HEAT) approach. Based on experiments on the publicly available LibriSpeechMix dataset, we show that HEAT can achieve better accuracy compared with PIT, and the SURT model with 150 milliseconds algorithmic latency constraint compares favorably with the offline sequence-to-sequence based baseline model in terms of accuracy.

Keywords:

Computer science Speech recognition End-to-end principle Encoder Latency (audio) Recurrent neural network Deep learning Artificial neural network Artificial intelligence

Metrics

Cited By

4.37

FWCI (Field Weighted Citation Impact)

Refs

0.95

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Speech Recognition and Synthesis

Physical Sciences → Computer Science → Artificial Intelligence

Music and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Speech and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Streaming End-to-End Multi-Talker Speech Recognition

Abstract

Metrics

Citation History

Topics

Related Documents

End-To-End Multi-Talker Overlapping Speech Recognition

Improving End-to-End Single-Channel Multi-Talker Speech Recognition

Unsupervised Domain Adaptation on End-to-End Multi-Talker Overlapped Speech Recognition

Endpoint Detection for Streaming End-to-End Multi-Talker ASR

Streaming End-to-end Speech Recognition for Mobile Devices