JOURNAL ARTICLE

A CTC Alignment-Based Non-Autoregressive Transformer for End-to-End Automatic Speech Recognition

Ruchao FanWei ChuChang PengAbeer Alwan

Year: 2023 Journal:   IEEE/ACM Transactions on Audio Speech and Language Processing Vol: 31 Pages: 1436-1448   Publisher: Institute of Electrical and Electronics Engineers

Abstract

Recently, end-to-end models have been widely used in automatic speech\nrecognition (ASR) systems. Two of the most representative approaches are\nconnectionist temporal classification (CTC) and attention-based encoder-decoder\n(AED) models. Autoregressive transformers, variants of AED, adopt an\nautoregressive mechanism for token generation and thus are relatively slow\nduring inference. In this paper, we present a comprehensive study of a CTC\nAlignment-based Single-Step Non-Autoregressive Transformer (CASS-NAT) for\nend-to-end ASR. In CASS-NAT, word embeddings in the autoregressive transformer\n(AT) are substituted with token-level acoustic embeddings (TAE) that are\nextracted from encoder outputs with the acoustical boundary information offered\nby the CTC alignment. TAE can be obtained in parallel, resulting in a parallel\ngeneration of output tokens. During training, Viterbi-alignment is used for TAE\ngeneration, and multiple training strategies are further explored to improve\nthe word error rate (WER) performance. During inference, an error-based\nalignment sampling method is investigated in depth to reduce the alignment\nmismatch in the training and testing processes. Experimental results show that\nthe CASS-NAT has a WER that is close to AT on various ASR tasks, while\nproviding a ~24x inference speedup. With and without self-supervised learning,\nwe achieve new state-of-the-art results for non-autoregressive models on\nseveral datasets. We also analyze the behavior of the CASS-NAT decoder to\nexplain why it can perform similarly to AT. We find that TAEs have similar\nfunctionality to word embeddings for grammatical structures, which might\nindicate the possibility of learning some semantic information from TAEs\nwithout a language model.\n

Keywords:
Computer science Autoregressive model Inference Transformer Speech recognition Security token Artificial intelligence Hidden Markov model Word error rate Pattern recognition (psychology) Engineering

Metrics

23
Cited By
5.88
FWCI (Field Weighted Citation Impact)
76
Refs
0.95
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Speech Recognition and Synthesis
Physical Sciences →  Computer Science →  Artificial Intelligence
Music and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing
Natural Language Processing Techniques
Physical Sciences →  Computer Science →  Artificial Intelligence

Related Documents

JOURNAL ARTICLE

A Transformer-Based End-to-End Automatic Speech Recognition Algorithm

Fang DongYiyang QianTianlei WangPeng LiuJiuwen Cao

Journal:   IEEE Signal Processing Letters Year: 2023 Vol: 30 Pages: 1592-1596
JOURNAL ARTICLE

Non-Autoregressive End-To-End Automatic Speech Recognition Incorporating Downstream Natural Language Processing

Motoi OmachiYuya FujitaShinji WatanabeTianzi Wang

Journal:   ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) Year: 2022 Vol: ii Pages: 6772-6776
JOURNAL ARTICLE

An End-to-End Transformer-Based Automatic Speech Recognition for Qur’an Reciters

Mohammed HadwanHamzah A. AlsayadiSalah Al-Hagree

Journal:   Computers, materials & continua/Computers, materials & continua (Print) Year: 2022 Vol: 74 (2)Pages: 3471-3487
© 2026 ScienceGate Book Chapters — All rights reserved.