JOURNAL ARTICLE

Multimodal Emotion Recognition with Attention

Abstract

Emotion recognition is one of main tasks in the human-computer interaction domain, and considering the recent advancement in Deep Learning, the topic greatly attracted the interest of the research community. The task is at the border between Affective Computing and Social Signal Processing, with applications that range from healthcare to robots, marketing and security. In this paper, we propose a novel approach for multimodal emotion recognition that introduces an attention mechanism which guides the extraction of visual features by means of the information extracted from the audio signal. The audio and visual information extraction is performed by convolutional neural networks applied to the Mel spectrogram of the audio signal and to equally-distanced video frames, respectively. The experimental results show that the insertion of the attention mechanism improve the overall accuracy of the emotion recognition system. The method is validated on a publicly available dataset, CREMA-D.

Keywords:
Computer science Spectrogram Convolutional neural network Task (project management) Emotion recognition Feature extraction Speech recognition Artificial intelligence Domain (mathematical analysis) Deep learning Human–computer interaction Audio signal Speech coding

Metrics

5
Cited By
2.08
FWCI (Field Weighted Citation Impact)
16
Refs
0.81
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Emotion and Mood Recognition
Social Sciences →  Psychology →  Experimental and Cognitive Psychology
Speech and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing
Music and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing
© 2026 ScienceGate Book Chapters — All rights reserved.