Leveraging Unimodal Self-Supervised Learning for Multimodal Audio-Visual Speech Recognition

Xichen Pan; Pei-yu Chen; Yichen Gong; Helong Zhou; Xinbing Wang; Zhouhan Lin

doi:10.18653/v1/2022.acl-long.308

ScienceGate Book Chapters

JOURNAL ARTICLE

Leveraging Unimodal Self-Supervised Learning for Multimodal Audio-Visual Speech Recognition

Xichen Pan Pei-yu Chen Yichen Gong Helong Zhou Xinbing Wang Zhouhan Lin

Year: 2022 Journal: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

DOI: 10.18653/v1/2022.acl-long.308

Get Full-Text PDF Get Analytical Report

Abstract

Training Transformer-based models demands a large amount of data, while obtaining aligned and labelled data in multimodality is rather cost-demanding, especially for audio-visual speech recognition (AVSR). Thus it makes a lot of sense to make use of unlabelled unimodal data. On the other side, although the effectiveness of large-scale self-supervised learning is well established in both audio and visual modalities, how to integrate those pre-trained models into a multimodal scenario remains underexplored. In this work, we successfully leverage unimodal self-supervised learning to promote the multimodal AVSR. In particular, audio and visual front-ends are trained on large-scale unimodal datasets, then we integrate components of both front-ends into a larger multimodal framework which learns to recognize parallel audio-visual data into characters through a combination of CTC and seq2seq decoding. We show that both components inherited from unimodal self-supervised learning cooperate well, resulting in that the multimodal framework yields competitive results through fine-tuning. Our model is experimentally validated on both word-level and sentence-level tasks. Especially, even without an external language model, our proposed model raises the state-of-the-art performances on the widely accepted Lip Reading Sentences 2 (LRS2) dataset by a large margin, with a relative improvement of 30%.

Keywords:

Computer science Leverage (statistics) Speech recognition Artificial intelligence Multimodal learning Transformer Margin (machine learning) Multimodality Supervised learning Sentence Natural language processing Machine learning

Metrics

Cited By

5.89

FWCI (Field Weighted Citation Impact)

Refs

0.97

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Speech and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Speech Recognition and Synthesis

Physical Sciences → Computer Science → Artificial Intelligence

Music and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Leveraging Unimodal Self-Supervised Learning for Multimodal Audio-Visual Speech Recognition

Abstract

Metrics

Citation History

Topics

Related Documents

Self-Supervised Audio-Visual Speech Representations Learning by Multimodal Self-Distillation

Robust Self-Supervised Audio-Visual Speech Recognition

Deep multimodal learning for Audio-Visual Speech Recognition

Multimodal learning using 3D audio-visual data for audio-visual speech recognition

Jointly Learning From Unimodal and Multimodal-Rated Labels in Audio-Visual Emotion Recognition