Fan QiHuaiwen ZhangXiaoshan YangChangsheng Xu
Multi-modal Emotion Recognition (MER) aims to identify various human emotions from heterogeneous modalities. With the development of emotional theories, there are more and more novel and fine-grained concepts to describe human emotional feelings. Real-world recognition systems often encounter unseen emotion labels. To address this challenge, we propose a versatile zero-shot MER framework to refine emotion label embeddings for capturing inter-label relationships and improving discrimination between labels. We integrate prior knowledge into a novel affective graph space that generates tailored label embeddings capturing inter-label relationships. To obtain multimodal representations, we disentangle the features of each modality into egocentric and altruistic components using adversarial learning. These components are then hierarchically fused using a hybrid co-attention mechanism. Furthermore, an emotion-guided decoder exploits label-modal dependencies to generate adaptive multimodal representations guided by emotion embeddings. We conduct extensive experiments with different multimodal combinations, including visual-acoustic and visual-textual inputs, on four datasets in both single-label and multi-label zero-shot settings. Results demonstrate the superiority of our proposed framework over state-of-the-art methods.
Guanqun CaoJiaqi JiangDanushka BollegalaMin LiShan Luo
Qi FanXiaoshan YangChangsheng Xu
Hai Chuan LiuAnis Salwa Mohd KhairuddinJoon Huang ChuahXian Min ZhaoXiao Dan WangLi FangSeong G. Kong
Prachi ShahP. PatelDeep Kothadiya
Xinzhou XuJun DengNicholas CumminsZixing ZhangLi ZhaoBjörn W. Schuller