Sound Event Detection (SED) is a critical subject in machine listening that aims to mimic the capacity of the human auditory system. Recently, convolutional recurrent neural networks (CRNN) have attained state-of-the-art SED performance. Local time-frequency information of audio are extracted using the convolution module in CRNN. However, global information cannot be obtained due to the size of the convolution kernel. Convolution module is replaced with conformer block module for the shortcoming, which combines the advantages of transformer and convolutional neural networks to successfully describe the local and global interdependence of audio sequences. When compared to CNN, RNN, and CRNN models using the TUT-SED 2017 dataset, the proposed method can improve F1-score by 9.86% and reduce ER by 0.1235 in the development dataset and improve F1-score by 9.13% and reduce ER by 0.0836 in the evaluation dataset. Experimental results demonstrate the superiority and effectiveness of the proposed approach.
Emre ÇakırGiambattista ParascandoloToni HeittolaHeikki HuttunenTuomas Virtanen
Yanxiong LiMingle LiuKonstantinos DrossosTuomas Virtanen
Junbo MaRuili WangWanting JiHao ZhengEn ZhuJianping Yin
Tianyao ZhouZeming HeRunze ZhaiXinzhou Xu