Visual Question Answering (VQA) models employ attention mechanisms to\ndiscover image locations that are most relevant for answering a specific\nquestion. For this purpose, several multimodal fusion strategies have been\nproposed, ranging from relatively simple operations (e.g., linear sum) to more\ncomplex ones (e.g., Block). The resulting multimodal representations define an\nintermediate feature space for capturing the interplay between visual and\nsemantic features, that is helpful in selectively focusing on image content. In\nthis paper, we propose a question-agnostic attention mechanism that is\ncomplementary to the existing question-dependent attention mechanisms. Our\nproposed model parses object instances to obtain an `object map' and applies\nthis map on the visual features to generate Question-Agnostic Attention (QAA)\nfeatures. In contrast to question-dependent attention approaches that are\nlearned end-to-end, the proposed QAA does not involve question-specific\ntraining, and can be easily included in almost any existing VQA model as a\ngeneric light-weight pre-processing step, thereby adding minimal computation\noverhead for training. Further, when used in complement with the\nquestion-dependent attention, the QAA allows the model to focus on the regions\ncontaining objects that might have been overlooked by the learned attention\nrepresentation. Through extensive evaluation on VQAv1, VQAv2 and TDIUC\ndatasets, we show that incorporating complementary QAA allows state-of-the-art\nVQA models to perform better, and provides significant boost to simplistic VQA\nmodels, enabling them to performance on par with highly sophisticated fusion\nstrategies.\n
Yang ShiTommaso FurlanelloSheng ZhaAnimashree Anandkumar
Jianing ZhangZhaochang WuHuajie ZhangYunfang Chen
Lianli GaoLiangfu CaoXing XuJie ShaoJingkuan Song
Vasileios LioutasNikolaos PassalisAnastasios Tefas