JOURNAL ARTICLE

Multimodal Emotion Prediction in Interpersonal Videos Integrating Facial and Speech Cues

Abstract

Emotion prediction is essential for affective computing applications, including human-computer interaction and social behavior analysis. In interpersonal settings, accurately predicting emotional states is crucial for modeling social dynamics. We propose a multimodal framework that integrates facial expressions and speech cues to enhance emotion prediction in interpersonal video interactions. Facial features are extracted via a deep attention-based network, while speech is encoded using Wav2Vec 2.0. The resulting multimodal features are modeled temporally using an LSTM network. To adapt the IMEmo dataset for multimodal learning, we introduce a novel speech-feature alignment strategy that ensures synchronization between facial and vocal expressions. Our approach investigates the impact of multi-modal fusion in emotion prediction, demonstrating its effectiveness in capturing complex emotional dynamics. Experiments show that our framework improves sentiment classification accuracy by over 17% compared to facial-only baselines. While fine-grained emotion recognition remains challenging, our results highlight the enhanced robustness and generalizability of our method in real-world interpersonal scenarios.

Keywords:

Metrics

0
Cited By
0.00
FWCI (Field Weighted Citation Impact)
25
Refs
0.35
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Topics

Emotion and Mood Recognition
Social Sciences →  Psychology →  Experimental and Cognitive Psychology
© 2026 ScienceGate Book Chapters — All rights reserved.