From Coarse to Fine: Efficient Training for Audio Spectrogram Transformers

Jiu Feng; Mehmet Hamza Erol; Joon Son Chung; Arda Senocak

doi:10.1109/icassp48485.2024.10448376

ScienceGate Book Chapters

JOURNAL ARTICLE

From Coarse to Fine: Efficient Training for Audio Spectrogram Transformers

Jiu Feng Mehmet Hamza Erol Joon Son Chung Arda Senocak

Year: 2024 Pages: 1416-1420

DOI: 10.1109/icassp48485.2024.10448376

Get Full-Text PDF Get Analytical Report

Abstract

Transformers have become central to recent advances in audio classification. However, training an audio spectrogram transformer, e.g. AST, from scratch can be resource and time-intensive. Furthermore, the complexity of transformers heavily depends on the input audio spectrogram size. In this work, we aim to optimize AST training by linking to the resolution in the time-axis. We introduce multi-phase training of audio spectrogram transformers by connecting the seminal idea of coarse-to-fine with transformer models. To achieve this, we propose a set of methods for temporal compression. By employing one of these methods, the transformer model learns from lowerresolution (coarse) data in the initial phases, and then is fine-tuned with high-resolution data later in a curriculum learning strategy. Experimental results demonstrate that the proposed training mechanism for AST leads to improved (or on-par) performance with faster convergence, i.e. requiring fewer computational resources and less time. This approach is also generalizable to other AST-based methods regardless of their learning paradigms.

Keywords:

Spectrogram Computer science Transformer Speech recognition Training set Artificial intelligence Voltage Engineering

Metrics

Cited By

1.43

FWCI (Field Weighted Citation Impact)

Refs

0.68

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Music and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Speech and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Speech Recognition and Synthesis

Physical Sciences → Computer Science → Artificial Intelligence

From Coarse to Fine: Efficient Training for Audio Spectrogram Transformers

Abstract

Metrics

Citation History

Topics

Related Documents

Efficient Fine-tuning of Audio Spectrogram Transformers via Soft Mixture of Adapters

Parameter-Efficient Transfer Learning of Audio Spectrogram Transformers

Adapter Incremental Continual Learning of Efficient Audio Spectrogram Transformers

Spectrogram Transformers for Audio Classification

MAST: Multiscale Audio Spectrogram Transformers