JOURNAL ARTICLE

Question-Agnostic Attention for Visual Question Answering

Abstract

Visual Question Answering (VQA) models employ attention mechanisms to\ndiscover image locations that are most relevant for answering a specific\nquestion. For this purpose, several multimodal fusion strategies have been\nproposed, ranging from relatively simple operations (e.g., linear sum) to more\ncomplex ones (e.g., Block). The resulting multimodal representations define an\nintermediate feature space for capturing the interplay between visual and\nsemantic features, that is helpful in selectively focusing on image content. In\nthis paper, we propose a question-agnostic attention mechanism that is\ncomplementary to the existing question-dependent attention mechanisms. Our\nproposed model parses object instances to obtain an `object map' and applies\nthis map on the visual features to generate Question-Agnostic Attention (QAA)\nfeatures. In contrast to question-dependent attention approaches that are\nlearned end-to-end, the proposed QAA does not involve question-specific\ntraining, and can be easily included in almost any existing VQA model as a\ngeneric light-weight pre-processing step, thereby adding minimal computation\noverhead for training. Further, when used in complement with the\nquestion-dependent attention, the QAA allows the model to focus on the regions\ncontaining objects that might have been overlooked by the learned attention\nrepresentation. Through extensive evaluation on VQAv1, VQAv2 and TDIUC\ndatasets, we show that incorporating complementary QAA allows state-of-the-art\nVQA models to perform better, and provides significant boost to simplistic VQA\nmodels, enabling them to performance on par with highly sophisticated fusion\nstrategies.\n

Keywords:

Metrics

6
Cited By
0.61
FWCI (Field Weighted Citation Impact)
26
Refs
0.67
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Advanced Image and Video Retrieval Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Domain Adaptation and Few-Shot Learning
Physical Sciences →  Computer Science →  Artificial Intelligence

Related Documents

JOURNAL ARTICLE

Visual Question Answering Based on Question Attention Model

Jianing ZhangZhaochang WuHuajie ZhangYunfang Chen

Journal:   Journal of Physics Conference Series Year: 2020 Vol: 1624 (2)Pages: 022022-022022
JOURNAL ARTICLE

Question-Led object attention for visual question answering

Lianli GaoLiangfu CaoXing XuJie ShaoJingkuan Song

Journal:   Neurocomputing Year: 2019 Vol: 391 Pages: 227-233
BOOK-CHAPTER

Multimodal Attention for Visual Question Answering

Lorena KodraElinda Kajo Meçe

Advances in intelligent systems and computing Year: 2018 Pages: 783-792
© 2026 ScienceGate Book Chapters — All rights reserved.