JOURNAL ARTICLE

Improving Visual Question Answering by Multimodal Gate Fusion Network

Abstract

Visual question answering (VQA) is a difficult multimodal task that requires answering questions about images. It requires a fine-grained level of understanding of both the visual content of the image and the textual content of the question. However, most of the existing models perform weakly in filtering noisy information and are unable to fuse features from multiple modalities effectively. To resolve the above restriction, we propose a novel multimodal gate fusion network (MGFN), which consists of an attention-on-attention interaction module (AoAIM) and a multimodal gate fusion module (MGFM). The role of AoAIM is to capture intra-modal and inter-modal dependencies and to filter out some irrelevant attention. The proposed MGFM can effectively fuse textual and visual features based on the relative importance of textual and visual modalities. We have performed many ablation experiments on the VQA-v2 dataset to validate the effectiveness of AoAIM and MGFM. The ablation experiments demonstrate that both AoAIM and MGFM play a key role in improving the performance of the model. By embedding these two modules, MGFN performs better than the previous state-of-the-art (SOTA) model on the VQA-v2 dataset. Particularly, the MGFN achieves an overall accuracy of 71.68% on the test-dev set and 72.12% on the test-std set.

Keywords:
Computer science Fuse (electrical) Question answering Artificial intelligence Set (abstract data type) Visualization Filter (signal processing) Test set Key (lock) Modalities Embedding Task (project management) Machine learning Natural language processing Pattern recognition (psychology) Computer vision

Metrics

2
Cited By
0.36
FWCI (Field Weighted Citation Impact)
52
Refs
0.53
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Domain Adaptation and Few-Shot Learning
Physical Sciences →  Computer Science →  Artificial Intelligence
Advanced Image and Video Retrieval Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition

Related Documents

JOURNAL ARTICLE

Multimodal fusion: advancing medical visual question-answering

Anjali MudgalUdbhav KushAditya KumarAmir Homayoun Jafari‬

Journal:   Neural Computing and Applications Year: 2024 Vol: 36 (33)Pages: 20949-20962
BOOK-CHAPTER

Multimodal Collaborative Attention Fusion Network for Remote Sensing Visual Question Answering

Ke HuWenzhen ZhangShichao Zhang

Communications in computer and information science Year: 2025 Pages: 310-322
JOURNAL ARTICLE

Question-Driven Graph Fusion Network for Visual Question Answering

Yuxi QianYuncong HuRuonan WangFangxiang FengXiaojie Wang

Journal:   2022 IEEE International Conference on Multimedia and Expo (ICME) Year: 2022 Pages: 1-6
© 2026 ScienceGate Book Chapters — All rights reserved.