On Local Temporal Embedding for Semi-Supervised Sound Event Detection

Lijian Gao; Qirong Mao; Ming Dong

doi:10.1109/taslp.2024.3369529

ScienceGate Book Chapters

JOURNAL ARTICLE

On Local Temporal Embedding for Semi-Supervised Sound Event Detection

Lijian Gao Qirong Mao Ming Dong

Year: 2024 Journal: IEEE/ACM Transactions on Audio Speech and Language Processing Vol: 32 Pages: 1687-1698 Publisher: Institute of Electrical and Electronics Engineers

DOI: 10.1109/taslp.2024.3369529

Get Full-Text PDF Get Analytical Report

Abstract

Semi-supervised sound event detection (SSED) task requires recognizing the categories of events and marking each event's onset and offset times in a mixed audio recording using a small amount of weakly labeled and a large scale of unlabeled data. So, exploring local temporal information, i.e., local discrimination and local correlations in the time domain, is essential for SSED, and in particular, for precise event boundary detection. Besides, as manual-labeled datasets are scarce, SSED tasks require effectively exploiting unlabelled data to reduce overfitting, typically through regularization techniques. Recently, self-supervised learning provided a viable solution to leverage unlabeled data for effective feature learning in various downstream tasks. In this paper, we propose LTE-Net, a novel multitask framework, to learn the Local Temporal Embedding for SSED. Specifically, LTE-Net first locally down-samples the input spectrogram and learns the token embeddings with a high temporal resolution (i.e., local discrimination). Then, LTE-Net effectively models the local correlations among the token embeddings through self-supervised masked spectrogram modeling. Finally, a novel joint (self- and semi-supervision) regularization framework is employed for the training of LTE-Net to effectively leverage unlabeled data in SSED. Extensive experiments on DCASE 2019, 2020 and 2021 SSED datasets show that LTE-Net significantly outperformed existing methods and achieved 2.1% to 8.7%, 2.1% to 3.9% and 1.2% to 6.1% performance gains on the evaluation set in 2019, 2020 and 2021 datasets, respectively.

Keywords:

Sound (geography) Event (particle physics) Embedding Computer science Artificial intelligence Speech recognition Acoustics Physics

Metrics

Cited By

9.98

FWCI (Field Weighted Citation Impact)

Refs

0.97

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Music and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Speech and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Music Technology and Sound Studies

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

On Local Temporal Embedding for Semi-Supervised Sound Event Detection

Abstract

Metrics

Citation History

Topics

Related Documents

Semi-Supervised Sound Event Detection with Local and Global Consistency Regularization

Debiased Training For Semi-supervised Sound Event Detection

Semi-Supervised NMF-CNN for Sound Event Detection

Couple learning for semi-supervised sound event detection

Semi-supervised Local Discriminant Embedding