Improving Attention-Based End-to-End Speech Recognition by Monotonic Alignment Attention Matrix Reconstruction

Ziyang Zhuang; Kun Zou; Chenfeng Miao; Ming Fang; Tao Wei; Zijian Li; Wei Hu; Shaojun Wang; Jing Xiao

doi:10.1109/icassp48485.2024.10447049

ScienceGate Book Chapters

JOURNAL ARTICLE

Improving Attention-Based End-to-End Speech Recognition by Monotonic Alignment Attention Matrix Reconstruction

Ziyang Zhuang Kun Zou Chenfeng Miao Ming Fang Tao Wei Zijian Li Wei Hu Shaojun Wang Jing Xiao

Year: 2024 Pages: 10546-10550

DOI: 10.1109/icassp48485.2024.10447049

Get Full-Text PDF Get Analytical Report

Abstract

In automatic speech recognition (ASR) task, the output sequence should correspond to a linear transcription of the input sequence. Lots of works have been done to learn the monotonic alignment in end-to-end (E2E) ASR model, but their methods mainly focus on streaming propose and usually result in a decline in ASR performance. On the contrary, some studies have shown that for non-streaming attention-based models, monotonic alignment is beneficial to model performance. Based on this motivation, we propose the enhanced Gaussian Monotonic Alignment (e-GMA), which reduces the difficulty of learning monotonic alignment, and the reconstructed attention matrix leads to an improved accuracy in ASR tasks. Experiments on the LibriSpeech dataset demonstrate the effectiveness of the proposed approach. Comparing with a strong baseline obtained from WeNet, the proposed model yields 12.2% relative WER reduction on test-clean benchmark and 9.9% on test-other.

Keywords:

Monotonic function Benchmark (surveying) Computer science Gaussian Speech recognition Task (project management) Sequence (biology) Focus (optics) End-to-end principle Artificial intelligence Algorithm Pattern recognition (psychology) Mathematics Engineering

Metrics

Cited By

0.64

FWCI (Field Weighted Citation Impact)

Refs

0.63

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Speech Recognition and Synthesis

Physical Sciences → Computer Science → Artificial Intelligence

Music and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Natural Language Processing Techniques

Physical Sciences → Computer Science → Artificial Intelligence

Improving Attention-Based End-to-End Speech Recognition by Monotonic Alignment Attention Matrix Reconstruction

Abstract

Metrics

Citation History

Topics

Related Documents

Towards Efficiently Learning Monotonic Alignments for Attention-based End-to-End Speech Recognition

Explicit Alignment of Text and Speech Encodings for Attention-Based End-to-End Speech Recognition

Character-Aware Attention-Based End-to-End Speech Recognition

Improving Attention Based Sequence-to-Sequence Models for End-to-End English Conversational Speech Recognition

Toward Developing Attention-Based End-To-End Automatic Speech Recognition