JOURNAL ARTICLE

Co-Attention Network With Question Type for Visual Question Answering

Chao YangMengqi JiangBin JiangWeixin ZhouKeqin Li

Year: 2019 Journal:   IEEE Access Vol: 7 Pages: 40771-40781   Publisher: Institute of Electrical and Electronics Engineers

Abstract

Visual Question Answering (VQA) is a challenging multi-modal learning task since it requires an understanding of both visual and textual modalities simultaneously. Therefore, the approaches used to represent the images and questions in a fine-grained manner play key roles in the performance. In order to obtain the fine-grained image and question representations, we develop a co-attention mechanism using an end-to-end deep network architecture to jointly learn both the image and the question features. Specifically, textual attention implemented by a self-attention model will reduce unrelated information and extract more discriminative features for question-level representations, which is in turn used to guide visual attention. We also note that a lot of finished works use complex models to extract feature representations but neglect to use high-level information summary such as question types in learning. Hence, we introduce the question type in our work by directly concatenating it with the multi-modal joint representation to narrow down the candidate answer space. A new network architecture combining the proposed co-attention mechanism and question type provides a unified model for VQA. The extensive experiments on two public datasets demonstrate the effectiveness of our model as compared with several state-of-the-art approaches.

Keywords:
Question answering Computer science Discriminative model Artificial intelligence Representation (politics) Feature (linguistics) Architecture Modal Task (project management) Feature learning Machine learning Key (lock) Modalities Deep learning Natural language processing

Metrics

51
Cited By
3.31
FWCI (Field Weighted Citation Impact)
69
Refs
0.94
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Advanced Image and Video Retrieval Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Domain Adaptation and Few-Shot Learning
Physical Sciences →  Computer Science →  Artificial Intelligence
© 2026 ScienceGate Book Chapters — All rights reserved.