In view of the singleness of traditional text emotion analysis, multimodal emotion analysis extends emotion analysis to the analysis level of text, image, sound and so on. In order to use the context interaction information expressed in each modality, a multimodal interactive emotion classification model based on video context is proposed in this paper. The ALBERTBiGRU network structure is built for text feature learning, and the independent BIGRU model is used to extract context features from text, audio and video modality. Based on the attention mechanism, the emotion analysis task is completed after the fusion of the three modal features. Compared with the existing models on MOSI and IEMOCAP data sets, the accuracy and F1 value of emotion classification reached 81.71% and 81.44 on MOSI data set, 66.97% and 67.20 on IEMOCAP data set, which were 1.41% and 2.16% higher than the highest benchmark value respectively, effectively improving the accuracy of multimodal emotion prediction.
Yingchao TangPuzhao HuLanfang DongMeng MaoGuoming LiLinxiang Tan
Zuhe LiQingbing GuoChengyao FengLujuan DengQiuwen ZhangJianwei ZhangFengqin WangQian Sun
Qi WangHaizheng YuYao WangHong Bian