Co-Attention Network With Question Type for Visual Question Answering

Chao Yang; Mengqi Jiang; Bin Jiang; Weixin Zhou; Keqin Li

doi:10.1109/access.2019.2908035

ScienceGate Book Chapters

JOURNAL ARTICLE

Co-Attention Network With Question Type for Visual Question Answering

Chao Yang Mengqi Jiang Bin Jiang Weixin Zhou Keqin Li

Year: 2019 Journal: IEEE Access Vol: 7 Pages: 40771-40781 Publisher: Institute of Electrical and Electronics Engineers

DOI: 10.1109/access.2019.2908035

Get Full-Text PDF Get Analytical Report

Abstract

Visual Question Answering (VQA) is a challenging multi-modal learning task since it requires an understanding of both visual and textual modalities simultaneously. Therefore, the approaches used to represent the images and questions in a fine-grained manner play key roles in the performance. In order to obtain the fine-grained image and question representations, we develop a co-attention mechanism using an end-to-end deep network architecture to jointly learn both the image and the question features. Specifically, textual attention implemented by a self-attention model will reduce unrelated information and extract more discriminative features for question-level representations, which is in turn used to guide visual attention. We also note that a lot of finished works use complex models to extract feature representations but neglect to use high-level information summary such as question types in learning. Hence, we introduce the question type in our work by directly concatenating it with the multi-modal joint representation to narrow down the candidate answer space. A new network architecture combining the proposed co-attention mechanism and question type provides a unified model for VQA. The extensive experiments on two public datasets demonstrate the effectiveness of our model as compared with several state-of-the-art approaches.

Keywords:

Question answering Computer science Discriminative model Artificial intelligence Representation (politics) Feature (linguistics) Architecture Modal Task (project management) Feature learning Machine learning Key (lock) Modalities Deep learning Natural language processing

Metrics

Cited By

3.31

FWCI (Field Weighted Citation Impact)

Refs

0.94

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Advanced Image and Video Retrieval Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Domain Adaptation and Few-Shot Learning

Physical Sciences → Computer Science → Artificial Intelligence

Co-Attention Network With Question Type for Visual Question Answering

Abstract

Metrics

Citation History

Topics

Related Documents

Dynamic Co-attention Network for Visual Question Answering

Question Type Guided Attention in Visual Question Answering

Co-attention graph convolutional network for visual question answering

Causality guided co-attention network for visual question answering

ViCAN: CO-ATTENTION NETWORK FOR VIETNAMESE VISUAL QUESTION ANSWERING