Multimodal sentiment analysis is a method that combines multiple modal information such as text, image and audio to analyze and understand people's emotional states and emotional expressions. This method can capture the sentiment more comprehensively and improve the accuracy and effect of sentiment analysis. However, most of the previous studies focus on the fusion mode between modalities, ignoring the emotional features beyond the text information, and fail to fully mine the semantic emotional information contained in it. To solve the above problems, an Attention Fusion Network with Crossmodal Emotion Enhancement (AFNCEE) is proposed. Firstly, Long Short-Term Memory (LSTM) was used to obtain contextual semantic information from a single modality, and the cross-modal Transformer stacked structure was used to fuse text, audio and visual modal features to enhance the hierarchical depth of fusion. Then, the SenticNet knowledge base is used to construct a text sentiment knowledge graph to enhance its additional representation. Finally, a feature-based attention fusion module is designed to dynamically adjust the additional representation and the weight of each modal representation, so as to realize multi-modal fusion.
Sadia AbdulhalimMuaz AlbaghdadiMoshiur Farazi