JOURNAL ARTICLE

Multi-Modal Structure-Embedding Graph Transformer for Visual Commonsense Reasoning

Jian ZhuHanli WangBin He

Year: 2023 Journal:   IEEE Transactions on Multimedia Vol: 26 Pages: 1295-1305   Publisher: Institute of Electrical and Electronics Engineers

Abstract

Visual commonsense reasoning (VCR) is a challenging reasoning task that aims to not only answer the question based on a given image but also provide a rationale justifying for the choice. Graph-based networks are appropriate to represent and extract the correlation between image and language for reasoning, where how to construct and learn graphs based on such multi-modal Euclidean data is a fundamental problem. Most existing graph-based methods view visual regions and linguistic words as identical graph nodes, ignoring inherent characteristics of multi-modal data. In addition, these approaches typically only have one graph-learning layer, and the performance declines as the model goes deeper. To address these issues, a novel method named Multi-modal Structure-embedding Graph Transformer (MSGT) is proposed. Specifically, an answer-vision graph and an answer-question graph are constructed to represent and model intra-modal and inter-modal correlations in VCR simultaneously, where additional multi-modal structure representations are initialized and embedded according to visual region distances and linguistic word orders for more reasonable graph representation. Then, a structure-injecting graph transformer is designed to inject embedded structure priors into the semantic correlation matrix for the evolution of node features and structure representations, which can stack more layers to make model deeper and extract more powerful features with instructive priors. To adaptively fuse graph features, a scored pooling mechanism is further developed to select valuable clues for reasoning from learnt node features. Experiments demonstrate the superiority of the proposed MSGT framework compared with state-of-the-art methods on the VCR benchmark dataset.

Keywords:
Computer science Commonsense reasoning Embedding Modal Commonsense knowledge Transformer Graph Artificial intelligence Theoretical computer science Knowledge-based systems

Metrics

10
Cited By
1.82
FWCI (Field Weighted Citation Impact)
46
Refs
0.82
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Advanced Image and Video Retrieval Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Video Analysis and Summarization
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition

Related Documents

JOURNAL ARTICLE

Visual Commonsense Reasoning with Vision-Language Co-embedding and Knowledge Graph Embedding

Jaeyun LeeIncheol Kim

Journal:   Journal of KIISE Year: 2020 Vol: 47 (10)Pages: 985-998
JOURNAL ARTICLE

Multi-Source Knowledge Reasoning Graph Network for Multi-Modal Commonsense Inference

Xuan MaXiaoshan YangChangsheng Xu

Journal:   ACM Transactions on Multimedia Computing Communications and Applications Year: 2022 Vol: 19 (4)Pages: 1-17
JOURNAL ARTICLE

Temporal-based graph reasoning for Visual Commonsense Reasoning

Shaojuan WuKexin LiuJitong LiPeng ChenXiaowang ZhangZhiyong Feng

Journal:   Knowledge-Based Systems Year: 2025 Vol: 315 Pages: 113214-113214
JOURNAL ARTICLE

Heterogeneous Graph Learning for Visual Commonsense Reasoning

Weijiang YuJingwen ZhouWeihao YuXiaodan LiangNong Xiao

Journal:   arXiv (Cornell University) Year: 2019 Vol: 32 Pages: 2765-2775
© 2026 ScienceGate Book Chapters — All rights reserved.