Multi-Modal Dynamic Graph Transformer for Visual Grounding

Sijia Chen; Baochun Li

doi:10.1109/cvpr52688.2022.01509

ScienceGate Book Chapters

JOURNAL ARTICLE

Multi-Modal Dynamic Graph Transformer for Visual Grounding

Sijia Chen Baochun Li

Year: 2022 Journal: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Pages: 15513-15522

DOI: 10.1109/cvpr52688.2022.01509

Get Full-Text PDF Get Analytical Report

Abstract

Visual grounding (VG) aims to align the correct regions of an image with a natural language query about that image. We found that existing VG methods are trapped by the single-stage grounding process that performs a sole evaluate-and-rank for meticulously prepared regions. Their performance depends on the density and quality of the candidate regions, and is capped by the inability to optimize the located regions continuously. To address these issues, we propose to remodel VG into a progressively optimized visual semantic alignment process. Our proposed multi-modal dynamic graph transformer (M-DGT) achieves this by building upon the dynamic graph structure with regions as nodes and their semantic relations as edges. Starting from a few randomly initialized regions, M-DGT is able to make sustainable adjustments (i.e., 2D spatial transformation and deletion) to the nodes and edges of the graph based on multi-modal information and the graph feature, thereby efficiently shrinking the graph to approach the ground truth regions. Experiments show that with an average of 48 boxes as initialization, the performance of M-DGT on the Flickr30k Entities and RefCOCO datasets outperforms existing state-of-the-art methods by a substantial margin, in terms of both accuracy and Intersect over Union (IOU) scores. Furthermore, introducing M-DGT to optimize the predicted regions of existing methods can further significantly improve their performance. The source codes are available at https://github.com/iQua/M-DGT.

Keywords:

Computer science Initialization Transformer Ground Modal Graph Ground truth Artificial intelligence Data mining Pattern recognition (psychology) Theoretical computer science Voltage

Metrics

Cited By

1.59

FWCI (Field Weighted Citation Impact)

Refs

0.88

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Advanced Image and Video Retrieval Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Domain Adaptation and Few-Shot Learning

Physical Sciences → Computer Science → Artificial Intelligence

Multi-Modal Dynamic Graph Transformer for Visual Grounding

Abstract

Metrics

Citation History

Topics

Related Documents

Dynamic Multi-modal Prompting for Efficient Visual Grounding

Multi-Modal Structure-Embedding Graph Transformer for Visual Commonsense Reasoning

Attribute-Prompting Multi-Modal Object Reasoning Transformer for Remote Sensing Visual Grounding

Learning Cross-Modal Context Graph for Visual Grounding

Multi-View Transformer for 3D Visual Grounding