Ruchao FanWei ChuChang PengAbeer Alwan
Recently, end-to-end models have been widely used in automatic speech\nrecognition (ASR) systems. Two of the most representative approaches are\nconnectionist temporal classification (CTC) and attention-based encoder-decoder\n(AED) models. Autoregressive transformers, variants of AED, adopt an\nautoregressive mechanism for token generation and thus are relatively slow\nduring inference. In this paper, we present a comprehensive study of a CTC\nAlignment-based Single-Step Non-Autoregressive Transformer (CASS-NAT) for\nend-to-end ASR. In CASS-NAT, word embeddings in the autoregressive transformer\n(AT) are substituted with token-level acoustic embeddings (TAE) that are\nextracted from encoder outputs with the acoustical boundary information offered\nby the CTC alignment. TAE can be obtained in parallel, resulting in a parallel\ngeneration of output tokens. During training, Viterbi-alignment is used for TAE\ngeneration, and multiple training strategies are further explored to improve\nthe word error rate (WER) performance. During inference, an error-based\nalignment sampling method is investigated in depth to reduce the alignment\nmismatch in the training and testing processes. Experimental results show that\nthe CASS-NAT has a WER that is close to AT on various ASR tasks, while\nproviding a ~24x inference speedup. With and without self-supervised learning,\nwe achieve new state-of-the-art results for non-autoregressive models on\nseveral datasets. We also analyze the behavior of the CASS-NAT decoder to\nexplain why it can perform similarly to AT. We find that TAEs have similar\nfunctionality to word embeddings for grammatical structures, which might\nindicate the possibility of learning some semantic information from TAEs\nwithout a language model.\n
Zhengkun TianJiangyan YiJianhua TaoYe BaiShuai ZhangZhengqi Wen
Fang DongYiyang QianTianlei WangPeng LiuJiuwen Cao
Motoi OmachiYuya FujitaShinji WatanabeTianzi Wang
Zhifu GaoShiliang ZhangIan McLoughlinZhijie Yan
Mohammed HadwanHamzah A. AlsayadiSalah Al-Hagree