With the fast development of social networks, the massive growth of the number of multimodal data such as images and texts allows people have higher demands for information processing from an emotional perspective. Emotion recognition requires a higher ability for the computer to simulate high-level visual perception understanding. However, existing methods often focus on the single-modality investigation. In this work, we propose a multimodal model based on factorized bilinear pooling (FBP) and adversarial learning for emotion recognition. In our model, a multimodal feature fusion network is proposed to encode the inter-modality features under the guidance of the FBP to help the visual and textual feature representation learn from each other interactively. Beyond that, we propose an adversarial network by introducing two discriminative classification tasks, emotion recognition and multimodal fusion prediction. Our entire method can be implemented end-to-end by using a deep neural network framework. Experimental results indicate that our proposed model achieves competitive performance on the extended FI dataset. Progressive results prove the ability of our model for emotion recognition against other single- and multi-modality works respectively.
Yong ZhangCheng ChengShuai WangTianqi Xia
Yuanyuan ZhangZi-Rui WangJun Du
Jingjun LiangShizhe ChenJinming ZhaoQin JinHaibo LiuLu Li
Dung NguyenKien NguyenSridha SridharanDavid DeanClinton Fookes