Sound Event Detection of Weakly Labelled Data With CNN-Transformer and Automatic Threshold Optimization

Qiuqiang Kong; Yong Xu; Wenwu Wang; Mark D. Plumbley

doi:10.1109/taslp.2020.3014737

ScienceGate Book Chapters

JOURNAL ARTICLE

Sound Event Detection of Weakly Labelled Data With CNN-Transformer and Automatic Threshold Optimization

Qiuqiang Kong Yong Xu Wenwu Wang Mark D. Plumbley

Year: 2020 Journal: IEEE/ACM Transactions on Audio Speech and Language Processing Vol: 28 Pages: 2450-2460 Publisher: Institute of Electrical and Electronics Engineers

DOI: 10.1109/taslp.2020.3014737

Get Full-Text PDF Get Analytical Report

Abstract

Sound event detection (SED) is a task to detect sound \nevents in an audio recording. One challenge of the SED task \nis that many datasets such as the Detection and Classification of Acoustic Scenes and Events (DCASE) datasets are weakly labelled. That is, there are only audio tags for each audio clip without the onset and offset times of sound events. We compare segment-wise and clip-wise training for SED that is lacking in previous works. We propose a convolutional neural network transformer (CNN-Transfomer) for audio tagging and SED, and show that CNN-Transformer performs similarly to a convolutional recurrent neural network (CRNN). Another challenge of SED is that thresholds are required for detecting \nsound events. Previous works set thresholds empirically, and are not an optimal approaches. To solve this problem, we propose an automatic threshold optimization method. The first stage is to optimize the system with respect to metrics that do not depend on thresholds, such as mean average precision (mAP). The second \nstage is to optimize the thresholds with respect to metrics that depends on those thresholds. Our proposed automatic threshold optimization system achieves a state-of-the-art audio tagging F1 of 0.646, outperforming that without threshold optimization of \n0.629, and a sound event detection F1 of 0.584, outperforming that without threshold optimization of 0.564.

Keywords:

Computer science Convolutional neural network Transformer Pattern recognition (psychology) Speech recognition Artificial intelligence Offset (computer science)

Metrics

121

Cited By

12.56

FWCI (Field Weighted Citation Impact)

Refs

0.99

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Music and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Speech and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Music Technology and Sound Studies

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Sound Event Detection of Weakly Labelled Data With CNN-Transformer and Automatic Threshold Optimization

Abstract

Metrics

Citation History

Topics

Related Documents

Sound event detection with weakly labelled data

Sound Event Detection and Time–Frequency Segmentation from Weakly Labelled Data

Sound Event Detection: A Wavelet Based Approach For Weakly Labelled Data

Weakly labeled sound event detection with a capsule-transformer model

CNN-Transformer with Self-Attention Network for Sound Event Detection