Zhor DiffallahHadjer YkhlefHafida Bouarfa
Sound event detection refers to the task of categorizing the types of events occurring in an audio recording, in addition to pinpointing the start and end times of each occurrence. This task has recently grown in popularity as a result of its aptitude to enhance a myriad of applications. Building sound event detection systems heavily relies on the representational power of deep neural network architectures. Deep network architectures require a large amount of strongly annotated audio data, where the exact temporal locations of each sound event are indicated. However, manually annotating audio recordings with the type of events present and the corresponding time boundaries is both costly and laborious. To mend this, learning from weak labels has been adopted in an attempt to bypass the labeling barrier. In this paper, we examine the effect of incorporating weakly-labeled data into the training process of sound event detection systems. Moreover, we analyze the behavior of the Mean Teacher framework under various deep learning configurations. Our experimental results reveal that training a well calibrated Mean Teacher structure; on weakly-labeled data, can improve the predictive performance of sound event detection systems.
Zhor DiffallahHadjer YkhlefHafida Bouarfa
Liwei LinXiangdong WangHong LiuYueliang Qian
Jie YanYan SongLi-Rong DaiIan McLoughlin
Xu ZhengYan SongIan McLoughlinLin LiuLi-Rong Dai