Aakansha MishraMiriyala Srinivas SoumitriVikram N Rajendiran
Reasoning conditioned on visual and linguistic information has gained immense importance in recent times. The prior art in Visual Question Answering (VQA) has been predominantly connectionist in nature. To resolve the issues of connectionist AI models, Symbolic models were proposed that allowed for explainable visual reasoning. In addition to semantic parsing, such models worked towards visual parsing resulting in scene graphs that provided scope for accurate reasoning conditioned on the explainable scene graphs. However, the real scenarios of VQA cannot always be segregated exclusively into connectionist (neural networks) and conceptual modalities. Rather, they are always dependent on the relationships and interactions between the two modalities. In this work, the authors proposed a question-guided attention mechanism that combines the approach of explainable visual reasoning through scene graphs with a cross-modality-based multi-head attention mechanism. The contributions of con-nectionist and conceptual modalities are learned through the semantic parsing of questions in each VQA task. The novel method is tested with the VQA2.0 and GQA and it resulted in 65.31% and 63.06% accuracy, respectively, which is better than the state-of-the-art in explainable AI.
Ling GaoHongda ZhangNan ShengLida ShiHao Xu
Kun LiGeorge VosselmanMichael Ying Yang