A number of recent works have proposed attention models for Visual Question Answering (VQA) that generate spatial maps highlighting image regions relevant answering the question. In this paper, we argue that in addition modeling where look or visual attention, it is equally important model what words listen to or question attention. We present a novel co-attention model for VQA that jointly reasons about image and question attention. In addition, our model reasons about the question (and consequently the image via the co-attention mechanism) in a hierarchical fashion via a novel 1-dimensional convolution neural networks (CNN). Our model improves the state-of-the-art on the VQA dataset from 60.3% 60.5%, and from 61.6% 63.3% on the COCO-QA dataset. By using ResNet, the performance is further improved 62.1% for VQA and 65.4% for COCO-QA.
Aakansha MishraAshish AnandPrithwijit Guha
Chao YangMengqi JiangBin JiangWeixin ZhouKeqin Li
Moshiur FaraziSalman KhanNick Barnes
Haibo YaoYongkang LuoZhi ZhangJianhang YangChengtao Cai