Multimodal Emotion Prediction in Interpersonal Videos Integrating Facial and Speech Cues

Hajer Guerdelli; Claudio Ferrari; Stefano Berretti; Alberto Del Bimbo

doi:10.1109/cvprw67362.2025.00566

ScienceGate Book Chapters

JOURNAL ARTICLE

Multimodal Emotion Prediction in Interpersonal Videos Integrating Facial and Speech Cues

Hajer Guerdelli Claudio Ferrari Stefano Berretti Alberto Del Bimbo

Year: 2025 Pages: 5681-5690

DOI: 10.1109/cvprw67362.2025.00566

Get Full-Text PDF Get Analytical Report

Abstract

Emotion prediction is essential for affective computing applications, including human-computer interaction and social behavior analysis. In interpersonal settings, accurately predicting emotional states is crucial for modeling social dynamics. We propose a multimodal framework that integrates facial expressions and speech cues to enhance emotion prediction in interpersonal video interactions. Facial features are extracted via a deep attention-based network, while speech is encoded using Wav2Vec 2.0. The resulting multimodal features are modeled temporally using an LSTM network. To adapt the IMEmo dataset for multimodal learning, we introduce a novel speech-feature alignment strategy that ensures synchronization between facial and vocal expressions. Our approach investigates the impact of multi-modal fusion in emotion prediction, demonstrating its effectiveness in capturing complex emotional dynamics. Experiments show that our framework improves sentiment classification accuracy by over 17% compared to facial-only baselines. While fine-grained emotion recognition remains challenging, our results highlight the enhanced robustness and generalizability of our method in real-world interpersonal scenarios.

Keywords:

Metrics

Cited By

0.00

FWCI (Field Weighted Citation Impact)

Refs

0.35

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Topics

Emotion and Mood Recognition

Social Sciences → Psychology → Experimental and Cognitive Psychology

Multimodal Emotion Prediction in Interpersonal Videos Integrating Facial and Speech Cues

Abstract

Metrics

Topics

Related Documents

Fusing facial and speech cues for enhanced multimodal emotion recognition

Custom CNNs Based Multimodal Emotion Recognition from Facial and Speech Cues

Multimodal Emotion Analysis for Depression Detection- Integrating Facial Expression and Speech Recognition

Deep Learning Model for Emotion Prediction from Speech, Facial Expression and Videos

Integrating linguistic cues into speech-based emotion recognition