JOURNAL ARTICLE

Multi-Modal Dynamic Graph Transformer for Visual Grounding

Sijia ChenBaochun Li

Year: 2022 Journal:   2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Pages: 15513-15522

Abstract

Visual grounding (VG) aims to align the correct regions of an image with a natural language query about that image. We found that existing VG methods are trapped by the single-stage grounding process that performs a sole evaluate-and-rank for meticulously prepared regions. Their performance depends on the density and quality of the candidate regions, and is capped by the inability to optimize the located regions continuously. To address these issues, we propose to remodel VG into a progressively optimized visual semantic alignment process. Our proposed multi-modal dynamic graph transformer (M-DGT) achieves this by building upon the dynamic graph structure with regions as nodes and their semantic relations as edges. Starting from a few randomly initialized regions, M-DGT is able to make sustainable adjustments (i.e., 2D spatial transformation and deletion) to the nodes and edges of the graph based on multi-modal information and the graph feature, thereby efficiently shrinking the graph to approach the ground truth regions. Experiments show that with an average of 48 boxes as initialization, the performance of M-DGT on the Flickr30k Entities and RefCOCO datasets outperforms existing state-of-the-art methods by a substantial margin, in terms of both accuracy and Intersect over Union (IOU) scores. Furthermore, introducing M-DGT to optimize the predicted regions of existing methods can further significantly improve their performance. The source codes are available at https://github.com/iQua/M-DGT.

Keywords:
Computer science Initialization Transformer Ground Modal Graph Ground truth Artificial intelligence Data mining Pattern recognition (psychology) Theoretical computer science Voltage

Metrics

23
Cited By
1.59
FWCI (Field Weighted Citation Impact)
58
Refs
0.88
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Advanced Image and Video Retrieval Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Domain Adaptation and Few-Shot Learning
Physical Sciences →  Computer Science →  Artificial Intelligence

Related Documents

JOURNAL ARTICLE

Multi-Modal Structure-Embedding Graph Transformer for Visual Commonsense Reasoning

Jian ZhuHanli WangBin He

Journal:   IEEE Transactions on Multimedia Year: 2023 Vol: 26 Pages: 1295-1305
JOURNAL ARTICLE

Learning Cross-Modal Context Graph for Visual Grounding

Yongfei LiuBo WanXiaodan ZhuXuming He

Journal:   Proceedings of the AAAI Conference on Artificial Intelligence Year: 2020 Vol: 34 (07)Pages: 11645-11652
JOURNAL ARTICLE

Multi-View Transformer for 3D Visual Grounding

Shijia HuangYilun ChenJiaya JiaLiwei Wang

Journal:   2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Year: 2022 Pages: 15503-15512
© 2026 ScienceGate Book Chapters — All rights reserved.