Improving Visual Question Answering by Multimodal Gate Fusion Network

Shenxiang Xiang; Qiaohong Chen; Xian Fang; Meng-Hao Guo

doi:10.1109/ijcnn54540.2023.10191453

ScienceGate Book Chapters

JOURNAL ARTICLE

Improving Visual Question Answering by Multimodal Gate Fusion Network

Shenxiang Xiang Qiaohong Chen Xian Fang Meng-Hao Guo

Year: 2023 Vol: 28 Pages: 1-8

DOI: 10.1109/ijcnn54540.2023.10191453

Get Full-Text PDF Get Analytical Report

Abstract

Visual question answering (VQA) is a difficult multimodal task that requires answering questions about images. It requires a fine-grained level of understanding of both the visual content of the image and the textual content of the question. However, most of the existing models perform weakly in filtering noisy information and are unable to fuse features from multiple modalities effectively. To resolve the above restriction, we propose a novel multimodal gate fusion network (MGFN), which consists of an attention-on-attention interaction module (AoAIM) and a multimodal gate fusion module (MGFM). The role of AoAIM is to capture intra-modal and inter-modal dependencies and to filter out some irrelevant attention. The proposed MGFM can effectively fuse textual and visual features based on the relative importance of textual and visual modalities. We have performed many ablation experiments on the VQA-v2 dataset to validate the effectiveness of AoAIM and MGFM. The ablation experiments demonstrate that both AoAIM and MGFM play a key role in improving the performance of the model. By embedding these two modules, MGFN performs better than the previous state-of-the-art (SOTA) model on the VQA-v2 dataset. Particularly, the MGFN achieves an overall accuracy of 71.68% on the test-dev set and 72.12% on the test-std set.

Keywords:

Computer science Fuse (electrical) Question answering Artificial intelligence Set (abstract data type) Visualization Filter (signal processing) Test set Key (lock) Modalities Embedding Task (project management) Machine learning Natural language processing Pattern recognition (psychology) Computer vision

Metrics

Cited By

0.36

FWCI (Field Weighted Citation Impact)

Refs

0.53

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Domain Adaptation and Few-Shot Learning

Physical Sciences → Computer Science → Artificial Intelligence

Advanced Image and Video Retrieval Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Improving Visual Question Answering by Multimodal Gate Fusion Network

Abstract

Metrics

Citation History

Topics

Related Documents

Multimodal fusion: advancing medical visual question-answering

Multimodal Collaborative Attention Fusion Network for Remote Sensing Visual Question Answering

Attention-based multimodal feature fusion visual question answering

Question-Driven Graph Fusion Network for Visual Question Answering

Visual Question Answering based on Multimodal Deep Feature Fusion