Chandra Churh ChatterjeeC. Chandra Sekhar
Approaches to Visual Question Answering (VQA) revolve around the fusion mechanism to combine the semantic information extracted from the image and the question. The proposed architecture for open-ended VQA architecture has four major components: (1) Vision encoder, (2) Language encoder, (3) Co-attention module, and (4) Answer generator. We explore different combinations of the vision encoder and the language encoder to obtain the representations of the input image and the question. We propose the nonlinear co-attention mechanism and stacked co-attention mechanism to obtain a combined representation of the representations of the image and the question. We also combine the representation of the caption of the input image, with the representations of the image and the question in the caption based stacked nonlinear co-attention mechanism. Results of experimental studies on VQAv2 dataset demonstrate that the open-ended VQA model that uses the caption based stacked nonlinear co-attention module gives an improved performance.
Sumedh PendurkarSameer KolpekwarShreyas DhootYashodhara HaribhaktaBiplab Banerjee
Aakansha MishraAshish AnandPrithwijit Guha
Mohamad Mahmoud Al RahhalYakoub BaziSara O. AlsalehMuna Al‐RazganMohamed Lamine MekhalfiMansour Al ZuairNaif Alajlan
Dong FengXiaofeng WangAmmar OadMir Sajjad Hussain Talpur