Shenxiang XiangQiaohong ChenXian FangMeng-Hao Guo
Visual question answering (VQA) is a difficult multimodal task that requires answering questions about images. It requires a fine-grained level of understanding of both the visual content of the image and the textual content of the question. However, most of the existing models perform weakly in filtering noisy information and are unable to fuse features from multiple modalities effectively. To resolve the above restriction, we propose a novel multimodal gate fusion network (MGFN), which consists of an attention-on-attention interaction module (AoAIM) and a multimodal gate fusion module (MGFM). The role of AoAIM is to capture intra-modal and inter-modal dependencies and to filter out some irrelevant attention. The proposed MGFM can effectively fuse textual and visual features based on the relative importance of textual and visual modalities. We have performed many ablation experiments on the VQA-v2 dataset to validate the effectiveness of AoAIM and MGFM. The ablation experiments demonstrate that both AoAIM and MGFM play a key role in improving the performance of the model. By embedding these two modules, MGFN performs better than the previous state-of-the-art (SOTA) model on the VQA-v2 dataset. Particularly, the MGFN achieves an overall accuracy of 71.68% on the test-dev set and 72.12% on the test-std set.
Anjali MudgalUdbhav KushAditya KumarAmir Homayoun Jafari
Ke HuWenzhen ZhangShichao Zhang
Qi ZouShixuan ZhangHongyan YangHuisheng Ma
Yuxi QianYuncong HuRuonan WangFangxiang FengXiaojie Wang
Xiangyang SheXuening JiangLihong Dong