Question-Agnostic Attention for Visual Question Answering

Moshiur Farazi; Salman Khan; Nick Barnes

doi:10.1109/icpr48806.2021.9413330

ScienceGate Book Chapters

JOURNAL ARTICLE

Question-Agnostic Attention for Visual Question Answering

Moshiur Farazi Salman Khan Nick Barnes

Year: 2021 Pages: 3542-3549

DOI: 10.1109/icpr48806.2021.9413330

Get Full-Text PDF Get Analytical Report

Abstract

Visual Question Answering (VQA) models employ attention mechanisms to\ndiscover image locations that are most relevant for answering a specific\nquestion. For this purpose, several multimodal fusion strategies have been\nproposed, ranging from relatively simple operations (e.g., linear sum) to more\ncomplex ones (e.g., Block). The resulting multimodal representations define an\nintermediate feature space for capturing the interplay between visual and\nsemantic features, that is helpful in selectively focusing on image content. In\nthis paper, we propose a question-agnostic attention mechanism that is\ncomplementary to the existing question-dependent attention mechanisms. Our\nproposed model parses object instances to obtain an `object map' and applies\nthis map on the visual features to generate Question-Agnostic Attention (QAA)\nfeatures. In contrast to question-dependent attention approaches that are\nlearned end-to-end, the proposed QAA does not involve question-specific\ntraining, and can be easily included in almost any existing VQA model as a\ngeneric light-weight pre-processing step, thereby adding minimal computation\noverhead for training. Further, when used in complement with the\nquestion-dependent attention, the QAA allows the model to focus on the regions\ncontaining objects that might have been overlooked by the learned attention\nrepresentation. Through extensive evaluation on VQAv1, VQAv2 and TDIUC\ndatasets, we show that incorporating complementary QAA allows state-of-the-art\nVQA models to perform better, and provides significant boost to simplistic VQA\nmodels, enabling them to performance on par with highly sophisticated fusion\nstrategies.\n

Keywords:

Metrics

Cited By

0.61

FWCI (Field Weighted Citation Impact)

Refs

0.67

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Advanced Image and Video Retrieval Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Domain Adaptation and Few-Shot Learning

Physical Sciences → Computer Science → Artificial Intelligence

Question-Agnostic Attention for Visual Question Answering

Abstract

Metrics

Citation History

Topics

Related Documents

Question Type Guided Attention in Visual Question Answering

Visual Question Answering Based on Question Attention Model

Question-Led object attention for visual question answering

Visual Question Answering using Explicit Visual Attention

Multimodal Attention for Visual Question Answering