JOURNAL ARTICLE

Multi-Modal Pre-Training for Automated Speech Recognition

David M. ChanShalini GhoshDebmalya ChakrabartyBjörn Hoffmeister

Year: 2022 Journal:   ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) Vol: 2 Pages: 246-250

Abstract

Traditionally, research in automated speech recognition has focused on local-first encoding of audio representations to predict the spoken phonemes in an utterance. Unfortunately, approaches relying on such hyper-local information tend to be vulnerable to both local-level corruption (such as audio-frame drops, or loud noises) and global-level noise (such as environmental noise, or background noise) that has not been seen during training. In this work, we introduce a novel approach that leverages a self-supervised learning technique based on masked language modeling to compute a global, multi-modal encoding of the environment in which the utterance occurs. We then use a new deep-fusion framework to integrate this global context into a traditional ASR method, and demonstrate that the resulting method can outperform baseline methods by up to 7% on Librispeech; gains on internal datasets range from 6% (on larger models) to 45% (on smaller models).

Keywords:
Computer science Utterance Speech recognition Noise (video) Encoding (memory) Context (archaeology) Artificial intelligence Acoustic model Modal Baseline (sea) Language model Natural language processing Speech processing Image (mathematics)

Metrics

13
Cited By
1.53
FWCI (Field Weighted Citation Impact)
43
Refs
0.83
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Speech Recognition and Synthesis
Physical Sciences →  Computer Science →  Artificial Intelligence
Music and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing
Speech and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing

Related Documents

JOURNAL ARTICLE

Multi-Stage Multi-Modal Pre-Training for Automatic Speech Recognition

Bandi Dixitha

Journal:   INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT Year: 2025 Vol: 09 (06)Pages: 1-9
JOURNAL ARTICLE

Automated training for speech recognition

L. A. SmithBrian L. ScottL. S. LinJ. M. Newell

Journal:   The Journal of the Acoustical Society of America Year: 1989 Vol: 86 (S1)Pages: S78-S78
BOOK-CHAPTER

Multi-task Pre-training for Lhasa-Tibetan Speech Recognition

Y. LiuYue ZhaoXiaona XuLiang XuXubei Zhang

Lecture notes in computer science Year: 2023 Pages: 78-90
© 2026 ScienceGate Book Chapters — All rights reserved.