Multiscale Audio Spectrogram Transformer for Efficient Audio Classification

Wentao Zhu; Mohamed Omar

doi:10.1109/icassp49357.2023.10096513

ScienceGate Book Chapters

JOURNAL ARTICLE

Multiscale Audio Spectrogram Transformer for Efficient Audio Classification

Wentao Zhu Mohamed Omar

Year: 2023 Pages: 1-5

DOI: 10.1109/icassp49357.2023.10096513

Get Full-Text PDF Get Analytical Report

Abstract

Audio event has a hierarchical architecture in both time and frequency and can be grouped together to construct more abstract semantic audio classes. In this work, we develop a multiscale audio spectrogram Transformer (MAST) that employs hierarchical representation learning for efficient audio classification. Specifically, MAST employs one-dimensional (and two-dimensional) pooling operators along the time (and frequency domains) in different stages, and progressively reduces the number of tokens and increases the feature dimensions. MAST significantly outperforms AST [1] by 22.2%, 4.4% and 4.7% on Kinetics-Sounds, Epic-Kitchens-100 and VGGSound in terms of the top-1 accuracy without external training data. On the downloaded AudioSet dataset, which has over 20% missing audios, MAST also achieves slightly better accuracy than AST. In addition, MAST is 5× more efficient in terms of multiply-accumulates (MACs) with 42% reduction in the number of parameters compared to AST. Through clustering metrics and visualizations, we demonstrate that the proposed MAST can learn semantically more separable feature representations from audio signals.

Keywords:

Computer science Spectrogram Pooling Pattern recognition (psychology) Cluster analysis Feature extraction Artificial intelligence Speech recognition Feature learning

Metrics

Cited By

6.98

FWCI (Field Weighted Citation Impact)

Refs

0.97

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Music and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Speech and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Music Technology and Sound Studies

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Multiscale Audio Spectrogram Transformer for Efficient Audio Classification

Abstract

Metrics

Citation History

Topics

Related Documents

Cough Classification Using Audio Spectrogram Transformer

LungAdapter: Efficient Adapting Audio Spectrogram Transformer for Lung Sound Classification

AST: Audio Spectrogram Transformer

MAST: Multiscale Audio Spectrogram Transformers

Audio Spectrogram Transformer-based Audio Classification using Voice data of Dementia Patients