Emotion recognition is one of main tasks in the human-computer interaction domain, and considering the recent advancement in Deep Learning, the topic greatly attracted the interest of the research community. The task is at the border between Affective Computing and Social Signal Processing, with applications that range from healthcare to robots, marketing and security. In this paper, we propose a novel approach for multimodal emotion recognition that introduces an attention mechanism which guides the extraction of visual features by means of the information extracted from the audio signal. The audio and visual information extraction is performed by convolutional neural networks applied to the Mel spectrogram of the audio signal and to equally-distanced video frames, respectively. The experimental results show that the insertion of the attention mechanism improve the overall accuracy of the emotion recognition system. The method is validated on a publicly available dataset, CREMA-D.
Gyanendra TiwaryS. D. Singh ChauhanKrishan Kumar Goyal
Esam GhalebJan NiehuesStylianos Asteriadis
Jiayi ChenVijay JohnYasutomo Kawanishi