A CTC Alignment-Based Non-Autoregressive Transformer for End-to-End Automatic Speech Recognition

Ruchao Fan; Wei Chu; Chang Peng; Abeer Alwan

doi:10.1109/taslp.2023.3263789

ScienceGate Book Chapters

JOURNAL ARTICLE

A CTC Alignment-Based Non-Autoregressive Transformer for End-to-End Automatic Speech Recognition

Ruchao Fan Wei Chu Chang Peng Abeer Alwan

Year: 2023 Journal: IEEE/ACM Transactions on Audio Speech and Language Processing Vol: 31 Pages: 1436-1448 Publisher: Institute of Electrical and Electronics Engineers

DOI: 10.1109/taslp.2023.3263789

Get Full-Text PDF Get Analytical Report

Abstract

Recently, end-to-end models have been widely used in automatic speech\nrecognition (ASR) systems. Two of the most representative approaches are\nconnectionist temporal classification (CTC) and attention-based encoder-decoder\n(AED) models. Autoregressive transformers, variants of AED, adopt an\nautoregressive mechanism for token generation and thus are relatively slow\nduring inference. In this paper, we present a comprehensive study of a CTC\nAlignment-based Single-Step Non-Autoregressive Transformer (CASS-NAT) for\nend-to-end ASR. In CASS-NAT, word embeddings in the autoregressive transformer\n(AT) are substituted with token-level acoustic embeddings (TAE) that are\nextracted from encoder outputs with the acoustical boundary information offered\nby the CTC alignment. TAE can be obtained in parallel, resulting in a parallel\ngeneration of output tokens. During training, Viterbi-alignment is used for TAE\ngeneration, and multiple training strategies are further explored to improve\nthe word error rate (WER) performance. During inference, an error-based\nalignment sampling method is investigated in depth to reduce the alignment\nmismatch in the training and testing processes. Experimental results show that\nthe CASS-NAT has a WER that is close to AT on various ASR tasks, while\nproviding a ~24x inference speedup. With and without self-supervised learning,\nwe achieve new state-of-the-art results for non-autoregressive models on\nseveral datasets. We also analyze the behavior of the CASS-NAT decoder to\nexplain why it can perform similarly to AT. We find that TAEs have similar\nfunctionality to word embeddings for grammatical structures, which might\nindicate the possibility of learning some semantic information from TAEs\nwithout a language model.\n

Keywords:

Computer science Autoregressive model Inference Transformer Speech recognition Security token Artificial intelligence Hidden Markov model Word error rate Pattern recognition (psychology) Engineering

Metrics

Cited By

5.88

FWCI (Field Weighted Citation Impact)

Refs

0.95

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Speech Recognition and Synthesis

Physical Sciences → Computer Science → Artificial Intelligence

Music and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Natural Language Processing Techniques

Physical Sciences → Computer Science → Artificial Intelligence

A CTC Alignment-Based Non-Autoregressive Transformer for End-to-End Automatic Speech Recognition

Abstract

Metrics

Citation History

Topics

Related Documents

Spike-Triggered Non-Autoregressive Transformer for End-to-End Speech Recognition

A Transformer-Based End-to-End Automatic Speech Recognition Algorithm

Non-Autoregressive End-To-End Automatic Speech Recognition Incorporating Downstream Natural Language Processing

Paraformer: Fast and Accurate Parallel Transformer for Non-autoregressive End-to-End Speech Recognition

An End-to-End Transformer-Based Automatic Speech Recognition for Qur’an Reciters