Visual Question Answering (VQA) is to reason out correct answers based on input questions and images. Significant progresses have been made by learning rich embedding features from images and questions by bilinear models. Attention mechanisms are widely used to focus on specific visual and textual information in VQA reasoning process. However, most state-of-the-art methods concentrate on fusing the global multi-modal features, while neglect local features. Besides, the dimension is reduced excessively (from K×2048 to 2048) in general visual attention, which causes a mass of visual information loss. In this paper, we propose a novel multi-channel co-attention network (MC-CAN), which integrates multi-modal features from global level to local level. We design different multi-channel attention mechanisms separately for visual (from K×2048 to M×2048) and textual features at different level of integrations. Additionally, we further improve our proposed approach by combining it with the complementary modules such as the MLB and the Count modules. Experiments on benchmark datasets show that our approach achieves better VQA performance than other state-of-the-art methods.
Ming SunQilong XuErcong WangWenjun WangLei TanXiu Zhao
Doaa B. EbaidMagda M. MadboulyAdel A. El-Zoghabi
Chao YangMengqi JiangBin JiangWeixin ZhouKeqin Li
Chuan LiuYing-Ying TanTian-Tian XiaJiajing ZhangMing Zhu
Jiali MiaoKui YuBaofu FangWu LeRichang Hong