Yinggang XieNannan ZhouShijuan Zhu
Multimodal emotion recognition presents two major challenges: the limited capacity to model higher-order interactions among modalities, and the difficulty of achieving effective fusion due to imbalanced data quality across modalities. To address these issues, this paper proposes a novel model based on hierarchical feature fusion. The model adopts a three-level fusion framework. First, it integrates static fusion with a dynamic weighting mechanism informed by Bayesian uncertainty estimation to achieve initial alignment and importance modeling of modality-specific features. Second, a multi-head cross-modal attention mechanism is introduced to capture contextual dependencies and complementary information across modalities. Finally, gated recurrent units are employed to model temporal dynamics, thereby enhancing the semantic-level fusion representation. Experimental results demonstrate that the proposed method achieves 84.6% accuracy on binary classification tasks using the MOSEI dataset and a weighted F1 score of 69.7% on the IEMOCAP dataset—representing a 2.1% improvement over the representative baseline, COGMEN. Ablation studies further validate the essential contributions of the multi-head attention mechanism, dynamic weighting strategy, and gated fusion module to the overall performance gains.
Yurui XuXiao WuHang SuXiaorui Liu
Ying WangJianjun LeiXiangwei ZhuTao Zhang
Xiaodong LiuSongyang LiMiao Wang
Xinxue DuJiahui YangXiaoxin Xie