Fabio VesperiniLeonardo GabrielliEmanuele PrincipiStefano Squartini
Artificial sound event detection (SED) has the aim to mimic the human ability\nto perceive and understand what is happening in the surroundings. Nowadays,\nDeep Learning offers valuable techniques for this goal such as Convolutional\nNeural Networks (CNNs). The Capsule Neural Network (CapsNet) architecture has\nbeen recently introduced in the image processing field with the intent to\novercome some of the known limitations of CNNs, specifically regarding the\nscarce robustness to affine transformations (i.e., perspective, size,\norientation) and the detection of overlapped images. This motivated the authors\nto employ CapsNets to deal with the polyphonic-SED task, in which multiple\nsound events occur simultaneously. Specifically, we propose to exploit the\ncapsule units to represent a set of distinctive properties for each individual\nsound event. Capsule units are connected through a so-called "dynamic routing"\nthat encourages learning part-whole relationships and improves the detection\nperformance in a polyphonic context. This paper reports extensive evaluations\ncarried out on three publicly available datasets, showing how the CapsNet-based\nalgorithm not only outperforms standard CNNs but also allows to achieve the\nbest results with respect to the state of the art algorithms.\n
Emre ÇakırGiambattista ParascandoloToni HeittolaHeikki HuttunenTuomas Virtanen
Junbo MaRuili WangWanting JiHao ZhengEn ZhuJianping Yin
Kai-Wen LiangYu-Hao TsengPao‐Chi Chang