JOURNAL ARTICLE

Multi-Module Co-Attention Model for Visual Question Answering

Abstract

Visual Question Answering(VQA) is a typical multi-modal problem in computer vision and natural language processing.Most of the existing VQA models ignore the dynamic relationships of semantic information between two modes and the rich spatial structure of an image.For this reason, the paper proposes a novel multi-module Co-Attention Network named MMCAN, which can fully understand the dynamic interactions between objects and contextual text representation in a visual scenario.Based on the graph attention mechanism, relations between different types of objects are modeled.The adaptive relation representation of the problem is learnt, and the visual object relations as well as the problem features are encoded through co-attention to strengthen the dependence between the words and corresponding image areas.Finally, the enhancement module is used to improve the fitting ability of the model. Experimental results on the open data set VQA 2.0 and VQA-CP V2 show that the accuracy of the proposed model is significantly better than that of DA-NTN, ReGAT and ODA-GCN for "total", "yes/no", "count" and "other" categories of questions.It can effectively improve the accuracy of visual question answering.

Keywords:
Question answering Representation (politics) Set (abstract data type) Relation (database) Object (grammar) Spatial relation Natural language Visualization

Metrics

1
Cited By
0.12
FWCI (Field Weighted Citation Impact)
0
Refs
0.44
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Topic Modeling
Physical Sciences →  Computer Science →  Artificial Intelligence
Advanced Technologies in Various Fields
Physical Sciences →  Computer Science →  Artificial Intelligence
© 2026 ScienceGate Book Chapters — All rights reserved.