JOURNAL ARTICLE

Zero-shot Video Emotion Recognition via Multimodal Protagonist-aware Transformer Network

Abstract

Recognizing human emotions from videos has attracted significant attention in numerous computer vision and multimedia applications, such as human-computer interaction and health care. It aims to understand the emotional response of humans, where candidate emotion categories are generally defined by specific psychological theories. However, with the development of psychological theories, emotion categories become increasingly diverse and fine-grained, samples are also increasingly difficult to collect. In this paper, we investigate a new task of zero-shot video emotion recognition, which aims to recognize rare unseen emotions. Specifically, we propose a novel multimodal protagonist-aware transformer network, which is composed of two branches: one is equipped with a novel dynamic emotional attention mechanism and a visual transformer to learn better visual representations; the other is an acoustic transformer for learning discriminative acoustic representations. We manage to align the visual and acoustic representations with semantic embeddings of fine-grained emotion labels through jointly mapping them into a common space under a noise contrastive estimation objective. Extensive experimental results on three datasets demonstrate the effectiveness of the proposed method.

Keywords:
Discriminative model Computer science Transformer Emotion recognition Semantic space Speech recognition Artificial intelligence Human–computer interaction

Metrics

10
Cited By
1.83
FWCI (Field Weighted Citation Impact)
61
Refs
0.83
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Emotion and Mood Recognition
Social Sciences →  Psychology →  Experimental and Cognitive Psychology
Human Pose and Action Recognition
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Video Surveillance and Tracking Methods
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
© 2026 ScienceGate Book Chapters — All rights reserved.