Abstract

We propose an end-to-end regression-based semi-supervised approach to the problem of sound event detection (SED) in this paper. As a segment-level method, the proposed approach to SED treats the prediction of the temporal location of acoustic events as a regression problem. It alleviates issues with classification-based approaches based on frame-level classification followed by smoothing operation. Treating SED as a regression problem, we design machine-learning systems to learn the relation between onset, duration, and acoustic features. To leverage unla-beled data in model training, we propose a simple and effective training process with a mean-teacher framework to improve the utilization of training data. We implemented two regression-based SED systems with CRNN and VGGSK models. On the validation set of Domestic Environment Sound Event Detection Dataset, the regression-based systems achieve respectively the event-based F1 scores of 0.480 and 0.523, and the intersection-based F1 scores of 0.673 and 0.728. Both systems outperform the classification-based systems using the same backbone network with event-based F1 scores of 0.429 and 0.463 and intersection-based F1 scores of 0.643 and 0.704.

Keywords:
Leverage (statistics) Computer science Smoothing Artificial intelligence Regression Machine learning Event (particle physics) Regression analysis Intersection (aeronautics) Pattern recognition (psychology) Statistics Mathematics Computer vision Engineering

Metrics

1
Cited By
0.27
FWCI (Field Weighted Citation Impact)
30
Refs
0.48
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Music and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing
Speech and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing
Music Technology and Sound Studies
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
© 2026 ScienceGate Book Chapters — All rights reserved.