DISSERTATION

Efficient Transformer-Based Speech Recognition

Vyas, Apoorv

Year: 2022 University:   Infoscience (Ecole Polytechnique Fédérale de Lausanne)

Abstract

Training deep neural network based Automatic Speech Recognition (ASR) models often requires thousands of hours of transcribed data, limiting their use to only a few languages. Moreover, current state-of-the-art acoustic models are based on the Transformer architecture that scales quadratically with sequence lengths, hindering its use for long sequences. This thesis aims to reduce (a) the data and (b) the compute requirements for developing state-of-the-art ASR systems with only a few hundred hours of transcribed data or less. The first part of this thesis focuses on reducing the amount of transcribed data required to train these models. We propose an approach that uses dropout for uncertainty-aware semi-supervised learning. We show that our approach generates better hypotheses for training with unlabelled data. We then investigate the out-of-domain and cross-lingual generalization for two popular self-supervised pre-training approaches: Masked Acoustic Modeling and wav2vec 2.0. We conclude that both pre-training approaches generalize to unseen domains and significantly outperform the models trained only with supervised data. In the second part, we focus on reducing the computational requirements for the Transformer model, (a) by devising efficient forms of attention computation and (b) by reducing the input context length for attention computation. We first present 'linear' attention that uses a kernelized formulation for attention to express an autoregressive transformer as a recurrent neural network and reduce the computational complexity from quadratic to linear in sequence length. We then present 'clustered' attention which approximates self-attention by clustering input sequence and using centroids for computation. We show that the clustered attention outperforms the vanilla attention for a given computational budget. For ASR, we find that linear attention results in word error rate degradation, and clustering introduces overheads when working with shorter sequences. To address this limitation, we develop a method that stochastically downsamples input using mean-pooling for efficient wav2vec 2.0 training. This enables using the same model at different compression factors during inference. We conclude that stochastic compression for wav2vec 2.0 pre-training enables building compute-efficient ASR models for languages with limited transcribed data.

Keywords:
Transformer Artificial neural network Autoregressive model Cluster analysis Computational complexity theory Hidden Markov model Embedding Generalization

Metrics

0
Cited By
0.00
FWCI (Field Weighted Citation Impact)
0
Refs
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Topics

Speech Recognition and Synthesis
Physical Sciences →  Computer Science →  Artificial Intelligence
Speech and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing
Music and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing

Related Documents

JOURNAL ARTICLE

Automatic speech recognition with efficient transformer

Shuhan Luo

Year: 2023 Vol: 1412 Pages: 186-186
JOURNAL ARTICLE

Untied Positional Encodings for Efficient Transformer-Based Speech Recognition

Lahiru SamarakoonIvan W. H. Fung

Journal:   2022 IEEE Spoken Language Technology Workshop (SLT) Year: 2023 Pages: 108-114
JOURNAL ARTICLE

Transformer-Based Turkish Automatic Speech Recognition

D. Emre TaşarKutan KoruyanCihan Çılgın

Journal:   Acta Infologica Year: 2024 Vol: 0 (0)Pages: 0-0
JOURNAL ARTICLE

Speech Emotion Recognition Based on Swin-Transformer

Zirou LiaoShaoping Shen

Journal:   Journal of Physics Conference Series Year: 2023 Vol: 2508 (1)Pages: 012056-012056
© 2026 ScienceGate Book Chapters — All rights reserved.