Advanced Long-Context End-to-End Speech Recognition Using Context-Expanded Transformers

Takaaki Hori; Niko Moritz; Chiori Hori; Jonathan Le Roux

doi:10.21437/interspeech.2021-1643

ScienceGate Book Chapters

JOURNAL ARTICLE

Advanced Long-Context End-to-End Speech Recognition Using Context-Expanded Transformers

Takaaki Hori Niko Moritz Chiori Hori Jonathan Le Roux

Year: 2021 Pages: 2097-2101

DOI: 10.21437/interspeech.2021-1643

Get Full-Text PDF Get Analytical Report

Abstract

This paper addresses end-to-end automatic speech recognition (ASR) for long audio recordings such as lecture and conversational speeches.Most end-to-end ASR models are designed to recognize independent utterances, but contextual information (e.g., speaker or topic) over multiple utterances is known to be useful for ASR.In our prior work, we proposed a contextexpanded Transformer that accepts multiple consecutive utterances at the same time and predicts an output sequence for the last utterance, achieving 5-15% relative error reduction from utterance-based baselines in lecture and conversational ASR benchmarks.Although the results have shown remarkable performance gain, there is still potential to further improve the model architecture and the decoding process.In this paper, we extend our prior work by (1) introducing the Conformer architecture to further improve the accuracy, (2) accelerating the decoding process with a novel activation recycling technique, and (3) enabling streaming decoding with triggered attention.We demonstrate that the extended Transformer provides state-of-the-art end-to-end ASR performance, obtaining a 17.3% character error rate for the HKUST dataset and 12.0%/6.3%word error rates for the Switchboard-300 Eval2000 CallHome/Switchboard test sets.The new decoding method reduces decoding time by more than 50% and further enables streaming ASR with limited accuracy degradation.

Keywords:

Computer science Decoding methods Utterance Speech recognition Transformer End-to-end principle Word error rate Context (archaeology) Artificial intelligence Algorithm Voltage

Metrics

Cited By

3.39

FWCI (Field Weighted Citation Impact)

Refs

0.93

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Speech Recognition and Synthesis

Physical Sciences → Computer Science → Artificial Intelligence

Speech and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Music and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Advanced Long-Context End-to-End Speech Recognition Using Context-Expanded Transformers

Abstract

Metrics

Citation History

Topics

Related Documents

Transformer-Based Long-Context End-to-End Speech Recognition

Dialog-Context Aware end-to-end Speech Recognition

Deep Context: End-to-end Contextual Speech Recognition

Towards Context-Aware End-to-End Code-Switching Speech Recognition

Do End-to-End Speech Recognition Models Care About Context?