Self-Supervised Audio-Visual Speech Representations Learning by Multimodal Self-Distillation

Jingxuan Zhang; Genshun Wan; Zhen-Hua Ling; Jia Pan; Jianqing Gao; Cong Liu

doi:10.1109/icassp49357.2023.10095029

ScienceGate Book Chapters

JOURNAL ARTICLE

Self-Supervised Audio-Visual Speech Representations Learning by Multimodal Self-Distillation

Jingxuan Zhang Genshun Wan Zhen-Hua Ling Jia Pan Jianqing Gao Cong Liu

Year: 2023 Pages: 1-5

DOI: 10.1109/icassp49357.2023.10095029

Get Full-Text PDF Get Analytical Report

Abstract

In this work, we present a novel method, named AV2vec, for learning audio-visual speech representations by multimodal self-distillation. AV2vec has a student and a teacher module, in which the student performs a masked latent feature regression task using the multimodal target features generated online by the teacher. The parameters of the teacher model are a momentum update of the student. Since our target features are generated online, AV2vec needs no iteration step like AV-HuBERT and the total training time cost is reduced to less than one-fifth. We further propose AV2vec-MLM in this study, which augments AV2vec with a masked language model (MLM)-style loss using multitask learning. Our experimental results show that AV2vec achieved comparable performance to the AV-HuBERT baseline. When combined with an MLM-style loss, AV2vec-MLM outperformed baselines and achieved the best performance on the downstream tasks.

Keywords:

Computer science Task (project management) Artificial intelligence Speech recognition Distillation Feature (linguistics) Multimodal learning Machine learning Baseline (sea) Multi-task learning Natural language processing

Metrics

Cited By

1.61

FWCI (Field Weighted Citation Impact)

Refs

0.80

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Speech and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Music and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Speech Recognition and Synthesis

Physical Sciences → Computer Science → Artificial Intelligence

Self-Supervised Audio-Visual Speech Representations Learning by Multimodal Self-Distillation

Abstract

Metrics

Citation History

Topics

Related Documents

Audio-guided self-supervised learning for disentangled visual speech representations

Leveraging Unimodal Self-Supervised Learning for Multimodal Audio-Visual Speech Recognition

Av-Data2Vec: Self-Supervised Learning of Audio-Visual Speech Representations with Contextualized Target Representations

Learning Self-supervised Audio-Visual Representations for Sound Recommendations

ES³: Evolving Self-Supervised Learning of Robust Audio-Visual Speech Representations

Self-Supervised Audio-Visual Speech Representations Learning by Multimodal Self-Distillation

Abstract

Metrics

Citation History

Topics

Related Documents

Audio-guided self-supervised learning for disentangled visual speech representations

Leveraging Unimodal Self-Supervised Learning for Multimodal Audio-Visual Speech Recognition

Av-Data2Vec: Self-Supervised Learning of Audio-Visual Speech Representations with Contextualized Target Representations

Learning Self-supervised Audio-Visual Representations for Sound Recommendations

ES3: Evolving Self-Supervised Learning of Robust Audio-Visual Speech Representations

ES³: Evolving Self-Supervised Learning of Robust Audio-Visual Speech Representations