Efficient Transformer-Based Speech Recognition

Vyas, Apoorv

doi:10.5075/epfl-thesis-9766

ScienceGate Book Chapters

DISSERTATION

Efficient Transformer-Based Speech Recognition

Vyas, Apoorv

Year: 2022 University: Infoscience (Ecole Polytechnique Fédérale de Lausanne)

DOI: 10.5075/epfl-thesis-9766

Get Full-Text PDF Get Analytical Report

Abstract

Training deep neural network based Automatic Speech Recognition (ASR) models often requires thousands of hours of transcribed data, limiting their use to only a few languages. Moreover, current state-of-the-art acoustic models are based on the Transformer architecture that scales quadratically with sequence lengths, hindering its use for long sequences. This thesis aims to reduce (a) the data and (b) the compute requirements for developing state-of-the-art ASR systems with only a few hundred hours of transcribed data or less. The first part of this thesis focuses on reducing the amount of transcribed data required to train these models. We propose an approach that uses dropout for uncertainty-aware semi-supervised learning. We show that our approach generates better hypotheses for training with unlabelled data. We then investigate the out-of-domain and cross-lingual generalization for two popular self-supervised pre-training approaches: Masked Acoustic Modeling and wav2vec 2.0. We conclude that both pre-training approaches generalize to unseen domains and significantly outperform the models trained only with supervised data. In the second part, we focus on reducing the computational requirements for the Transformer model, (a) by devising efficient forms of attention computation and (b) by reducing the input context length for attention computation. We first present 'linear' attention that uses a kernelized formulation for attention to express an autoregressive transformer as a recurrent neural network and reduce the computational complexity from quadratic to linear in sequence length. We then present 'clustered' attention which approximates self-attention by clustering input sequence and using centroids for computation. We show that the clustered attention outperforms the vanilla attention for a given computational budget. For ASR, we find that linear attention results in word error rate degradation, and clustering introduces overheads when working with shorter sequences. To address this limitation, we develop a method that stochastically downsamples input using mean-pooling for efficient wav2vec 2.0 training. This enables using the same model at different compression factors during inference. We conclude that stochastic compression for wav2vec 2.0 pre-training enables building compute-efficient ASR models for languages with limited transcribed data.

Keywords:

Transformer Artificial neural network Autoregressive model Cluster analysis Computational complexity theory Hidden Markov model Embedding Generalization

Metrics

Cited By

0.00

FWCI (Field Weighted Citation Impact)

Refs

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Topics

Speech Recognition and Synthesis

Physical Sciences → Computer Science → Artificial Intelligence

Speech and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Music and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Efficient Transformer-Based Speech Recognition

Abstract

Metrics

Topics

Related Documents

Automatic speech recognition with efficient transformer

Untied Positional Encodings for Efficient Transformer-Based Speech Recognition

Transformer-Based Turkish Automatic Speech Recognition

TASER-Net: Transformer Based Speech Emotion Recognition

Speech Emotion Recognition Based on Swin-Transformer