Multi-Modal Structure-Embedding Graph Transformer for Visual Commonsense Reasoning

Jian Zhu; Hanli Wang; Bin He

doi:10.1109/tmm.2023.3279691

ScienceGate Book Chapters

JOURNAL ARTICLE

Multi-Modal Structure-Embedding Graph Transformer for Visual Commonsense Reasoning

Jian Zhu Hanli Wang Bin He

Year: 2023 Journal: IEEE Transactions on Multimedia Vol: 26 Pages: 1295-1305 Publisher: Institute of Electrical and Electronics Engineers

DOI: 10.1109/tmm.2023.3279691

Get Full-Text PDF Get Analytical Report

Abstract

Visual commonsense reasoning (VCR) is a challenging reasoning task that aims to not only answer the question based on a given image but also provide a rationale justifying for the choice. Graph-based networks are appropriate to represent and extract the correlation between image and language for reasoning, where how to construct and learn graphs based on such multi-modal Euclidean data is a fundamental problem. Most existing graph-based methods view visual regions and linguistic words as identical graph nodes, ignoring inherent characteristics of multi-modal data. In addition, these approaches typically only have one graph-learning layer, and the performance declines as the model goes deeper. To address these issues, a novel method named Multi-modal Structure-embedding Graph Transformer (MSGT) is proposed. Specifically, an answer-vision graph and an answer-question graph are constructed to represent and model intra-modal and inter-modal correlations in VCR simultaneously, where additional multi-modal structure representations are initialized and embedded according to visual region distances and linguistic word orders for more reasonable graph representation. Then, a structure-injecting graph transformer is designed to inject embedded structure priors into the semantic correlation matrix for the evolution of node features and structure representations, which can stack more layers to make model deeper and extract more powerful features with instructive priors. To adaptively fuse graph features, a scored pooling mechanism is further developed to select valuable clues for reasoning from learnt node features. Experiments demonstrate the superiority of the proposed MSGT framework compared with state-of-the-art methods on the VCR benchmark dataset.

Keywords:

Computer science Commonsense reasoning Embedding Modal Commonsense knowledge Transformer Graph Artificial intelligence Theoretical computer science Knowledge-based systems

Metrics

Cited By

1.82

FWCI (Field Weighted Citation Impact)

Refs

0.82

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Advanced Image and Video Retrieval Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Video Analysis and Summarization

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Multi-Modal Structure-Embedding Graph Transformer for Visual Commonsense Reasoning

Abstract

Metrics

Citation History

Topics

Related Documents

Knowledge Induced Graph Transformer for Visual Commonsense Reasoning

Visual Commonsense Reasoning with Vision-Language Co-embedding and Knowledge Graph Embedding

Multi-Source Knowledge Reasoning Graph Network for Multi-Modal Commonsense Inference

Temporal-based graph reasoning for Visual Commonsense Reasoning

Heterogeneous Graph Learning for Visual Commonsense Reasoning