According to the problems of the existing emotion recognition algorithms, which are not rich in emotion information, weak in feature representation and not high in recognition accuracy, this paper proposes a multimodal fusion emotion recognition algorithm based on Transformer (TMFER), which fuses three modalities of text, speech and image information for emotion recognition. For the different characteristics of each modal information, Bert model pre-training processing, MFCC feature extraction and CNN feature extractor extraction methods are used to extract features for each modality respectively, to explore deeper features. To address the problem of unreasonable combination of multi-modal features, the Transformer Encode multi-headed attention mechanism is used to build a feature fusion module to extract and combine potential feature information in different modalities in parallel. The fused data are fed into the algorithm classification module for sentiment recognition classification, and a joint supervised loss function based on large margin learning is customized to solve the problem of unbalanced classification and feature confounding in the baseline model. Finally, based on the IEMOCAP and MELD multimodal datasets, the TMFER algorithm is experimentally compared with current algorithms in the field that are more effective in emotion recognition classification. The experimental results show that the TMFER algorithm outperforms other algorithms in all evaluation metrics.
Jian HuangJianhua TaoBin LiuZheng LianMingyue Niu
Shamane SiriwardhanaTharindu KaluarachchiMark BillinghurstSuranga Nanayakkara
Yuanyuan WangYu GuYifei YinYingping HanHe ZhangShuang WangChenyu LiDou Quan
Wesagn Dawit ChemmaAdane MamuyeMarco Piangerelli