JOURNAL ARTICLE

Transformer-Based Bioacoustic Sound Event Detection on Few-Shot Learning Tasks

Abstract

Automatic detection of bioacoustic sound events is crucial to monitor wildlife. With a tedious annotation process, limited labeled events and large volume of recordings, few-shot learning (FSL) is suitable for such event detections based on a few examples. Typical FSL frameworks for sound detection make use of Convolutional Neural Networks (CNNs) to extract features. However, CNNs fail to capture long-range relationships and global context in audio data. We present an approach that combines the audio spectrogram transformer (AST), a data augmentation regime and transductive inference to detect sound events on the DCASE2022 (Task 5) dataset. Our results show that the AST model performs better on all recordings when compared to a CNN based model. With transductive inference on FSL tasks, our approach has 6% improvement over the baseline AST feature extraction pipeline. Our approach generalizes well over sound events from different animal species, recordings and durations, suggesting its effectiveness for FSL tasks.

Keywords:
Spectrogram Computer science Bioacoustics Inference Artificial intelligence Feature extraction Convolutional neural network Context (archaeology) Speech recognition Event (particle physics) Pattern recognition (psychology) Machine learning

Metrics

14
Cited By
3.76
FWCI (Field Weighted Citation Impact)
13
Refs
0.92
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Music and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing
Animal Vocal Communication and Behavior
Life Sciences →  Biochemistry, Genetics and Molecular Biology →  Developmental Biology
Speech and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing
© 2026 ScienceGate Book Chapters — All rights reserved.