Yanxia GUO, Yong JIN, Hong TANG, Jinzhi PENG
To prevent important information containing emotional cues from being obscured by irrelevant information in discourse and to achieve multi-modal information interaction, a multi-modal emotion recognition model based on dynamic convolution and residual gating is proposed by mining advanced local features and designing effective interaction fusion strategies. Low-level features, high-level local features, and contextual dependencies from text, audio, and images are extracted. While using cross modal dynamic convolution to model inter-modal and intra-modal interactions, interactions are simulated between long sequences in time domain, and interaction features of different modalities are captured. A residual gated fusion method that fuses different modal interaction representations automatically learns the impact weight of each interaction feature on the final output, and inputs the multi-modal fusion feature into the classifier for emotion prediction. The experimental results show that this model prevents important information regarding emotional cues from being obscured by irrelevant information in multi-modal data. The accuracy of sentiment classification is 83.5% and 83.9% on the CMU-MOSEI and IEMOCAP datasets, respectively. The model outperforms benchmark models such as Multi-modal Transformer(MulT) and Multi-Fusion Residual Memory(MFRM).
Musheng ChenQiang WenXiaohong QiuJunhua WuWenqing Fu
Yadi WangXiaoding GuoXianhong HouZhijun MiaoXiaojin YangJinkai Guo
Qi ZhuChuhang ZhengZheng ZhangWei ShaoDaoqiang Zhang
Yichen FengXinfeng YeSathiamoorthy Manoharan