Xu WangXiangjinzi ZhangYunfei ZiShengwu Xiong
Sound event detection(SED) consists of two subtasks: predicting the classes of sound events within an audio clip (audio tagging) and indicating the onset and offset times for each event (localization). One of the common approaches for SED with weak label is multiple instance learning (MIL) method. However, the general MIL method only optimizes the global loss calculated from the aggregated clip-wise predictions and weak clip labels, lacking a direct constraint on the frame-wise predictions, which leads to a large number of unreasonable prediction values. To address this issue, we explore the deterministic information that can be used to constrain the framewise predictions and based on which we design a frame loss with two terms. Experimental results on the DCASE2017 Task4 dataset demonstrate that the proposed loss can improve the performance of general MIL method. While this article focuses on SED applications, the proposed methods could be applied widely to MIL problems. Code will be available at WSSED.
Yu TianGuansong PangFengbei LiuYuyuan LiuChong WangYuanhong ChenJohan VerjansGustavo Carneiro
Wei GaoFang WanJun YueSongcen XuQixiang Ye
Hui LvZhongqi YueQianru SunBin LuoZhen CuiHanwang Zhang
Liwei LinXiangdong WangHong LiuYueliang Qian