Ying HuXiujuan ZhuYunlong LiHao HuangLiang He
Sound event detection (SED) is an interesting but challenging task due to the\nscarcity of data and diverse sound events in real life. This paper presents a\nmulti-grained based attention network (MGA-Net) for semi-supervised sound event\ndetection. To obtain the feature representations related to sound events, a\nresidual hybrid convolution (RH-Conv) block is designed to boost the vanilla\nconvolution's ability to extract the time-frequency features. Moreover, a\nmulti-grained attention (MGA) module is designed to learn temporal resolution\nfeatures from coarse-level to fine-level. With the MGA module,the network could\ncapture the characteristics of target events with short- or long-duration,\nresulting in more accurately determining the onset and offset of sound events.\nFurthermore, to effectively boost the performance of the Mean Teacher (MT)\nmethod, a spatial shift (SS) module as a data perturbation mechanism is\nintroduced to increase the diversity of data. Experimental results show that\nthe MGA-Net outperforms the published state-of-the-art competitors, achieving\n53.27% and 56.96% event-based macro F1 (EB-F1) score, 0.709 and 0.739\npolyphonic sound detection score (PSDS) on the validation and public set\nrespectively.\n
Maolin TangQijun ZhaoZhengxi Liu
Yadong GuanJiabin XueGuibin ZhengJiqing Han
Chia‐Chuan LiuChia-Ping ChenChung-Li LuBo-Cheng ChanYu-Han ChengHsiang-Feng ChuangWei-Yu Chen
SHEN Yaxin, GAO Lijian , MAO Qirong