An Adapter Based Pre-Training for Efficient and Scalable Self-Supervised Speech Representation Learning

Samuel Kessler; Bethan Thomas; Salah Karout

doi:10.1109/icassp43922.2022.9747374

ScienceGate Book Chapters

JOURNAL ARTICLE

An Adapter Based Pre-Training for Efficient and Scalable Self-Supervised Speech Representation Learning

Samuel Kessler Bethan Thomas Salah Karout

Year: 2022 Journal: ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) Pages: 3179-3183

DOI: 10.1109/icassp43922.2022.9747374

Get Full-Text PDF Get Analytical Report

Abstract

We present a method for transferring pre-trained self-supervised (SSL) speech representations to multiple languages. There is an abundance of unannotated speech, so creating self-supervised representations from raw audio and fine-tuning on small annotated datasets is a promising direction to build speech recognition systems. SSL models generally perform SSL on raw audio in a pre-training phase and then fine-tune on a small fraction of annotated data. Such models have produced state of the art results for ASR. However, these models are very expensive to pre-train. We use an existing wav2vec 2.0 model and tackle the problem of learning new language representations while utilizing existing model knowledge. Crucially we do so without catastrophic forgetting of the existing language representation. We use adapter modules to speed up pre-training a new language task. Our model can decrease pre-training times by 32% when learning a new language task, and learn this new audio-language representation without forgetting previous language representation. We evaluate by applying these language representations to automatic speech recognition.

Keywords:

Computer science Scalability Adapter (computing) Language model Artificial intelligence Natural language processing Forgetting Speech recognition Task (project management) Multi-task learning Representation (politics) Feature learning Database

Metrics

Cited By

1.76

FWCI (Field Weighted Citation Impact)

Refs

0.85

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Speech Recognition and Synthesis

Physical Sciences → Computer Science → Artificial Intelligence

Music and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Speech and Audio Processing

Physical Sciences → Computer Science → Signal Processing

An Adapter Based Pre-Training for Efficient and Scalable Self-Supervised Speech Representation Learning

Abstract

Metrics

Citation History

Topics

Related Documents

A Noise-Robust Self-Supervised Pre-Training Model Based Speech Representation Learning for Automatic Speech Recognition

emotion2vec: Self-Supervised Pre-Training for Speech Emotion Representation

Fast-Hubert: an Efficient Training Framework for Self-Supervised Speech Representation Learning

Censer: Curriculum Semi-supervised Learning for Speech Recognition Based on Self-supervised Pre-training

Masked Contrastive Representation Learning for Self-Supervised Visual Pre-Training