In this paper we propose to solve the problem of Visual Question Answering by using a novel segmentation guided attention based network which we call SegAttend-Net.We use image segmentation maps, generated by a Fully Convolutional Deep Neural Network to refine our attention maps and use these refined attention maps to make the model focus on the relevant parts of the image to answer a question.The refined attention maps are used by the LSTM network to learn to produce the answer.We presently train our model on the visual7W dataset and do a category wise evaluation of the 7 question categories.We achieve state of the art results on this dataset and beat the previous benchmark on this dataset by a 1.5% margin improving the question answering accuracy from 54.1% to 55.6% and demonstrate improvements in each of the question categories.We also visualize our generated attention maps and note their improvement over the attention maps generated by the previous best approach.
Haibin LiuShengrong GongYi JiJianyu YangTengfei XingChunping Liu
Yang ShiTommaso FurlanelloSheng ZhaAnimashree Anandkumar
Linqin CaiNuoying XuHang TianKejia ChenHaodu Fan
Xiang ShenDezhi HanChin‐Chen ChangLiang Zong