Cross-modal Scene Graph Matching for Relationship-aware Image-Text Retrieval

Sijin Wang; Ruiping Wang; Ziwei Yao; Shiguang Shan; Xilin Chen

doi:10.1109/wacv45572.2020.9093614

ScienceGate Book Chapters

JOURNAL ARTICLE

Cross-modal Scene Graph Matching for Relationship-aware Image-Text Retrieval

Sijin Wang Ruiping Wang Ziwei Yao Shiguang Shan Xilin Chen

Year: 2020

DOI: 10.1109/wacv45572.2020.9093614

Get Full-Text PDF Get Analytical Report

Abstract

Image-text retrieval of natural scenes has been a popular research topic. Since image and text are heterogeneous cross-modal data, one of the key challenges is how to learn comprehensive yet unified representations to express the multi-modal data. A natural scene image mainly involves two kinds of visual concepts, objects and their relationships, which are equally essential to image-text retrieval. Therefore, a good representation should account for both of them. In the light of recent success of scene graph in many CV and NLP tasks for describing complex natural scenes, we propose to represent image and text with two kinds of scene graphs: visual scene graph (VSG) and textual scene graph (TSG), each of which is exploited to jointly characterize objects and relationships in the corresponding modality. The image-text retrieval task is then naturally formulated as cross-modal scene graph matching. Specifically, we design two particular scene graph encoders in our model for VSG and TSG, which can refine the representation of each node on the graph by aggregating neighborhood information. As a result, both object-level and relationship-level cross-modal features can be obtained, which favorably enables us to evaluate the similarity of image and text in the two levels in a more plausible way. We achieve state-of-the-art results on Flickr30k and MS COCO, which verifies the advantages of our graph matching based approach for image-text retrieval.

Keywords:

Computer science Scene graph Graph Artificial intelligence Modal Image retrieval Visual Word Matching (statistics) Information retrieval Computer vision Representation (politics) Image (mathematics) Pattern recognition (psychology) Theoretical computer science Mathematics

Metrics

236

Cited By

15.64

FWCI (Field Weighted Citation Impact)

Refs

0.99

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Advanced Image and Video Retrieval Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Topic Modeling

Physical Sciences → Computer Science → Artificial Intelligence

Cross-modal Scene Graph Matching for Relationship-aware Image-Text Retrieval

Abstract

Metrics

Citation History

Topics

Related Documents

Cross-modal Graph Matching Network for Image-text Retrieval

Scene-text aware cross-modal retrieval based on semantic matching (ChinaMM2024)

Cross-modal multi-relationship aware reasoning for image-text matching

Text-Image Matching for Cross-Modal Remote Sensing Image Retrieval via Graph Neural Network

Cross-modal independent matching network for image-text retrieval