ZOU Pinrong, XIAO Feng, ZHANG Wenjuan, ZHANG Wanyu, WANG Chenyang
Visual Question Answering(VQA) is a typical multi-modal problem in computer vision and natural language processing.Most of the existing VQA models ignore the dynamic relationships of semantic information between two modes and the rich spatial structure of an image.For this reason, the paper proposes a novel multi-module Co-Attention Network named MMCAN, which can fully understand the dynamic interactions between objects and contextual text representation in a visual scenario.Based on the graph attention mechanism, relations between different types of objects are modeled.The adaptive relation representation of the problem is learnt, and the visual object relations as well as the problem features are encoded through co-attention to strengthen the dependence between the words and corresponding image areas.Finally, the enhancement module is used to improve the fitting ability of the model. Experimental results on the open data set VQA 2.0 and VQA-CP V2 show that the accuracy of the proposed model is significantly better than that of DA-NTN, ReGAT and ODA-GCN for "total", "yes/no", "count" and "other" categories of questions.It can effectively improve the accuracy of visual question answering.
D. Karthika RenukaL. Ashok KumarR. RajakumariR. VinithaR. MeenaBathula Swathi
Weidong TianBin HeNan-Xun WangZhong‐Qiu Zhao
Aakansha MishraAshish AnandPrithwijit Guha
Aakansha MishraAshish AnandPrithwijit Guha