Masked Spectrogram Prediction for Self-Supervised Audio Pre-Training

Dading Chong; Helin Wang; Peilin Zhou; Qingcheng Zeng

doi:10.1109/icassp49357.2023.10095691

ScienceGate Book Chapters

JOURNAL ARTICLE

Masked Spectrogram Prediction for Self-Supervised Audio Pre-Training

Dading Chong Helin Wang Peilin Zhou Qingcheng Zeng

Year: 2023 Pages: 1-5

DOI: 10.1109/icassp49357.2023.10095691

Get Full-Text PDF Get Analytical Report

Abstract

Transformer-based models attain excellent results and generalize well when trained on sufficient amounts of data. However, constrained by the limited data available in the audio domain, most transformer-based models for audio tasks are finetuned from pre-trained models in other domains (e.g. image), which has a notable gap with the audio domain. Other methods explore the self-supervised learning approaches directly in the audio domain but currently do not perform well in the downstream tasks. In this paper, we present a novel self-supervised learning method for transformer-based audio models, called masked spectrogram prediction (MaskSpec), to learn powerful audio representations from unlabeled audio data (AudioSet used in this paper). Our method masks random patches of the input spectrogram and reconstructs the masked regions with an encoder-decoder architecture. Experimental results demonstrate MaskSpec reaches the performance of 0.471 (mAP) on AudioSet, 0.854 (mAP) on Open-MIC2018, 0.982 (accuracy) on ESC-50, 0.976 (accuracy) on SCV2, and 0.823 (accuracy) on DCASE2019 Task1A. The source code and pre-trained models have been released.¹

Keywords:

Spectrogram Computer science Encoder Artificial intelligence Transformer Speech recognition Audio analyzer Deep learning Speech coding Audio signal Labeled data Pattern recognition (psychology) Audio signal processing

Metrics

Cited By

9.40

FWCI (Field Weighted Citation Impact)

Refs

0.98

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Music and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Speech and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Speech Recognition and Synthesis

Physical Sciences → Computer Science → Artificial Intelligence

Masked Spectrogram Prediction for Self-Supervised Audio Pre-Training

Abstract

Metrics

Citation History

Topics

Related Documents

Masked Feature Prediction for Self-Supervised Visual Pre-Training

SSAST: Self-Supervised Audio Spectrogram Transformer

GeoMAE: Masked Geometric Target Prediction for Self-supervised Point Cloud Pre-Training

Self-Supervised Pre-Training with Masked Shape Prediction for 3D Scene Understanding

Self-Supervised Radio Pre-training: Toward Foundational Models for Spectrogram Learning