Justin SalamonJuan Pablo Bello
The ability of deep convolutional neural networks (CNN) to learn\ndiscriminative spectro-temporal patterns makes them well suited to\nenvironmental sound classification. However, the relative scarcity of labeled\ndata has impeded the exploitation of this family of high-capacity models. This\nstudy has two primary contributions: first, we propose a deep convolutional\nneural network architecture for environmental sound classification. Second, we\npropose the use of audio data augmentation for overcoming the problem of data\nscarcity and explore the influence of different augmentations on the\nperformance of the proposed CNN architecture. Combined with data augmentation,\nthe proposed model produces state-of-the-art results for environmental sound\nclassification. We show that the improved performance stems from the\ncombination of a deep, high-capacity model and an augmented training set: this\ncombination outperforms both the proposed CNN without augmentation and a\n"shallow" dictionary learning model with augmentation. Finally, we examine the\ninfluence of each augmentation on the model's classification accuracy for each\nclass, and observe that the accuracy for each class is influenced differently\nby each augmentation, suggesting that the performance of the model could be\nimproved further by applying class-conditional data augmentation.\n
Abhishek NigudgiAnkush AwantySuvarna S. NandyaS ChuS NarayananC.-C KuoR RadhakrishnanA DivakaranP SmaragdisC MydlarzJ SalamonJ BelloA MesarosT HeittolaO DikmenT VirtanenE BenetosG LafayM LagrangeM PlumbleyV BisotR SerizelS EssidG RichardJ SalamonJ BelloJ GeigerK HelwaniE CakirT HeittolaH HuttunenT VirtanenK PiczakD GiannoulisE BenetosD StowellM RossignolM LagrangeM PlumbleyD StowellD GiannoulisE BenetosM LagrangeM PlumbleyS SigtiaA StarkS KrstulovicM Plumbley