We propose an end-to-end regression-based semi-supervised approach to the problem of sound event detection (SED) in this paper. As a segment-level method, the proposed approach to SED treats the prediction of the temporal location of acoustic events as a regression problem. It alleviates issues with classification-based approaches based on frame-level classification followed by smoothing operation. Treating SED as a regression problem, we design machine-learning systems to learn the relation between onset, duration, and acoustic features. To leverage unla-beled data in model training, we propose a simple and effective training process with a mean-teacher framework to improve the utilization of training data. We implemented two regression-based SED systems with CRNN and VGGSK models. On the validation set of Domestic Environment Sound Event Detection Dataset, the regression-based systems achieve respectively the event-based F1 scores of 0.480 and 0.523, and the intersection-based F1 scores of 0.673 and 0.728. Both systems outperform the classification-based systems using the same backbone network with event-based F1 scores of 0.429 and 0.463 and intersection-based F1 scores of 0.643 and 0.704.
SHEN Yaxin, GAO Lijian , MAO Qirong
Rui TaoLong YanKazushige OuchiXiangdong Wang
Xu ZhengYan SongJie YanLi-Rong DaiIan McLoughlinLin Liu
Ziqiang ShiLiu LiuHuibin LinRujie LiuAnyan Shi
Liwei LinXiangdong WangHong LiuYueliang Qian