Streaming End-to-End Target-Speaker Automatic Speech Recognition and Activity Detection

Takafumi Moriya; Hiroshi Satō; Tsubasa Ochiai; Marc Delcroix; Takahiro Shinozaki

doi:10.1109/access.2023.3243690

ScienceGate Book Chapters

JOURNAL ARTICLE

Streaming End-to-End Target-Speaker Automatic Speech Recognition and Activity Detection

Takafumi Moriya Hiroshi Satō Tsubasa Ochiai Marc Delcroix Takahiro Shinozaki

Year: 2023 Journal: IEEE Access Vol: 11 Pages: 13906-13917 Publisher: Institute of Electrical and Electronics Engineers

DOI: 10.1109/access.2023.3243690

Get Full-Text PDF Get Analytical Report

Abstract

Automatic speech recognition of a target speaker in the presence of interfering speakers remains a challenging issue. One approach to tackle this problem is target-speaker speech recognition, which conditions the recognition process on an embedding that characterizes the voice of the target speaker. This enables recognizing only the speech of the target speaker while ignoring interferences. In this work, we propose an end-to-end target-speaker speech recognition system based on a neural transducer architecture to allow streaming and on-device recognition. Moreover, a target-speaker speech recognition system should be able to detect when the target speaker is inactive and output nothing in such a case. We introduce training and decoding schemes to allow target-speaker activity detection within our proposed recognition system. We confirm experimentally that our proposed end-to-end system performs competitively to conventional cascade approaches of a target speech extraction module and a recognition module while reducing computation costs and allowing streaming decoding.

Keywords:

Computer science Speech recognition Speaker recognition Voice activity detection Decoding methods Speaker diarisation End-to-end principle Speech processing Artificial intelligence Pattern recognition (psychology)

Metrics

Cited By

2.68

FWCI (Field Weighted Citation Impact)

Refs

0.88

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Speech and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Speech Recognition and Synthesis

Physical Sciences → Computer Science → Artificial Intelligence

Music and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Streaming End-to-End Target-Speaker Automatic Speech Recognition and Activity Detection

Abstract

Metrics

Citation History

Topics

Related Documents

End-to-End Multi-Speaker Speech Recognition

Multi-Speaker Data Augmentation for Improved end-to-end Automatic Speech Recognition

End-to-End Automatic Speech Recognition Integrated with CTC-Based Voice Activity Detection

End-to-End Multilingual Multi-Speaker Speech Recognition

Survey of end-to-end multi-speaker automatic speech recognition for monaural audio