JOURNAL ARTICLE

Cross-modal Scene Graph Matching for Relationship-aware Image-Text Retrieval

Abstract

Image-text retrieval of natural scenes has been a popular research topic. Since image and text are heterogeneous cross-modal data, one of the key challenges is how to learn comprehensive yet unified representations to express the multi-modal data. A natural scene image mainly involves two kinds of visual concepts, objects and their relationships, which are equally essential to image-text retrieval. Therefore, a good representation should account for both of them. In the light of recent success of scene graph in many CV and NLP tasks for describing complex natural scenes, we propose to represent image and text with two kinds of scene graphs: visual scene graph (VSG) and textual scene graph (TSG), each of which is exploited to jointly characterize objects and relationships in the corresponding modality. The image-text retrieval task is then naturally formulated as cross-modal scene graph matching. Specifically, we design two particular scene graph encoders in our model for VSG and TSG, which can refine the representation of each node on the graph by aggregating neighborhood information. As a result, both object-level and relationship-level cross-modal features can be obtained, which favorably enables us to evaluate the similarity of image and text in the two levels in a more plausible way. We achieve state-of-the-art results on Flickr30k and MS COCO, which verifies the advantages of our graph matching based approach for image-text retrieval.

Keywords:
Computer science Scene graph Graph Artificial intelligence Modal Image retrieval Visual Word Matching (statistics) Information retrieval Computer vision Representation (politics) Image (mathematics) Pattern recognition (psychology) Theoretical computer science Mathematics

Metrics

236
Cited By
15.64
FWCI (Field Weighted Citation Impact)
52
Refs
0.99
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Advanced Image and Video Retrieval Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Topic Modeling
Physical Sciences →  Computer Science →  Artificial Intelligence

Related Documents

JOURNAL ARTICLE

Cross-modal Graph Matching Network for Image-text Retrieval

Yuhao ChengXiaoguang ZhuJiuchao QianFei WenPeilin Liu

Journal:   ACM Transactions on Multimedia Computing Communications and Applications Year: 2022 Vol: 18 (4)Pages: 1-23
JOURNAL ARTICLE

Cross-modal multi-relationship aware reasoning for image-text matching

Jin ZhangXiaohai HeLinbo QingLuping LiuXiaodong Luo

Journal:   Multimedia Tools and Applications Year: 2021 Vol: 81 (9)Pages: 12005-12027
JOURNAL ARTICLE

Text-Image Matching for Cross-Modal Remote Sensing Image Retrieval via Graph Neural Network

Hongfeng YuFanglong YaoWanxuan LuNayu LiuPeiguang LiHongjian YouXian Sun

Journal:   IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing Year: 2022 Vol: 16 Pages: 812-824
JOURNAL ARTICLE

Cross-modal independent matching network for image-text retrieval

Ke XiaoBaitao ChenXiong YangYuhang CaiHao LíuWenzhong Guo

Journal:   Pattern Recognition Year: 2024 Vol: 159 Pages: 111096-111096
© 2026 ScienceGate Book Chapters — All rights reserved.