Guoming ChenZhuoxian QianDong ZhangShuang QiuRuqi Zhou
Deep neural networks have demonstrated significant potential in applications such as human-computer interaction and emotion analysis, particularly in multimodal emotion recognition. However, they remain vulnerable to adversarial examples, where subtle perturbations can severely degrade classifier performance. Inspired by the sparse, asynchronous spiking activity and inherent nonlinearity of spiking neural networks (SNNs), we propose a novel framework, the Sliding Parallel Spiking Convolutional Vision Transformer (SPSNCVT), designed for robust and efficient multimodal emotion recognition. Our framework processes multiple signals, including facial expressions, voice, and text, using aligned heatmap features and multiscale wavelet transforms for precise feature extraction. Experimental results indicate that the SPSNCVT framework significantly improves classification accuracy when confronted with adversarial attacks such as fast gradient sign method (FGSM), basic iterative method (BIM), and projected gradient descent (PGD), achieving a performance gain of 3.60%-4.01% and 7.03%-13.73% compared to baseline models. Furthermore, SPSNCVT demonstrates excellent performance in terms of energy efficiency and computational speed, highlighting its practical deployment potential in real-time application scenarios.
Mikhail LeontevDmitry AntonovSergey Sukhov
Zhiquan QinGuoxing LiuXianming Lin
Elif KancaSelen AyasElif Baykal KablanMurat Ekіncі
Syed Aun Muhammad ZaidiSiddique LatifJunaid Qadir
Weiran GuoGuanjun LiuZiyuan ZhouLing WangJiacun Wang