JOURNAL ARTICLE

On Local Temporal Embedding for Semi-Supervised Sound Event Detection

Lijian GaoQirong MaoMing Dong

Year: 2024 Journal:   IEEE/ACM Transactions on Audio Speech and Language Processing Vol: 32 Pages: 1687-1698   Publisher: Institute of Electrical and Electronics Engineers

Abstract

Semi-supervised sound event detection (SSED) task requires recognizing the categories of events and marking each event's onset and offset times in a mixed audio recording using a small amount of weakly labeled and a large scale of unlabeled data. So, exploring local temporal information, i.e., local discrimination and local correlations in the time domain, is essential for SSED, and in particular, for precise event boundary detection. Besides, as manual-labeled datasets are scarce, SSED tasks require effectively exploiting unlabelled data to reduce overfitting, typically through regularization techniques. Recently, self-supervised learning provided a viable solution to leverage unlabeled data for effective feature learning in various downstream tasks. In this paper, we propose LTE-Net, a novel multitask framework, to learn the Local Temporal Embedding for SSED. Specifically, LTE-Net first locally down-samples the input spectrogram and learns the token embeddings with a high temporal resolution (i.e., local discrimination). Then, LTE-Net effectively models the local correlations among the token embeddings through self-supervised masked spectrogram modeling. Finally, a novel joint (self- and semi-supervision) regularization framework is employed for the training of LTE-Net to effectively leverage unlabeled data in SSED. Extensive experiments on DCASE 2019, 2020 and 2021 SSED datasets show that LTE-Net significantly outperformed existing methods and achieved 2.1% to 8.7%, 2.1% to 3.9% and 1.2% to 6.1% performance gains on the evaluation set in 2019, 2020 and 2021 datasets, respectively.

Keywords:
Sound (geography) Event (particle physics) Embedding Computer science Artificial intelligence Speech recognition Acoustics Physics

Metrics

14
Cited By
9.98
FWCI (Field Weighted Citation Impact)
47
Refs
0.97
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Music and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing
Speech and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing
Music Technology and Sound Studies
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
© 2026 ScienceGate Book Chapters — All rights reserved.