Caption based Co-attention Architecture for Open-Ended Visual Question Answering

Chandra Churh Chatterjee; C. Chandra Sekhar

doi:10.1109/indicon59947.2023.10440681

ScienceGate Book Chapters

JOURNAL ARTICLE

Caption based Co-attention Architecture for Open-Ended Visual Question Answering

Chandra Churh Chatterjee C. Chandra Sekhar

Year: 2023 Vol: 29 Pages: 167-172

DOI: 10.1109/indicon59947.2023.10440681

Get Full-Text PDF Get Analytical Report

Abstract

Approaches to Visual Question Answering (VQA) revolve around the fusion mechanism to combine the semantic information extracted from the image and the question. The proposed architecture for open-ended VQA architecture has four major components: (1) Vision encoder, (2) Language encoder, (3) Co-attention module, and (4) Answer generator. We explore different combinations of the vision encoder and the language encoder to obtain the representations of the input image and the question. We propose the nonlinear co-attention mechanism and stacked co-attention mechanism to obtain a combined representation of the representations of the image and the question. We also combine the representation of the caption of the input image, with the representations of the image and the question in the caption based stacked nonlinear co-attention mechanism. Results of experimental studies on VQAv2 dataset demonstrate that the open-ended VQA model that uses the caption based stacked nonlinear co-attention module gives an improved performance.

Keywords:

Question answering Computer science Architecture Closed-ended question Visual attention Information retrieval Natural language processing Artificial intelligence Linguistics Psychology History

Metrics

Cited By

0.00

FWCI (Field Weighted Citation Impact)

Refs

0.21

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Topics

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Advanced Image and Video Retrieval Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Video Analysis and Summarization

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Caption based Co-attention Architecture for Open-Ended Visual Question Answering

Abstract

Metrics

Topics

Related Documents

Attention Based Multi-Modal Fusion Architecture for Open-Ended Video Question Answering Systems

Aggregated Co-attention based Visual Question Answering

Visual question answering algorithm based on image caption

Open-ended remote sensing visual question answering with transformers

Co-attention Network for Visual Question Answering Based on Dual Attention