JOURNAL ARTICLE

Image–Text Cross-Modal Retrieval with Instance Contrastive Embedding

Ruigeng ZengWentao MaXiaoqian WuWei LiuJie Liu

Year: 2024 Journal:   Electronics Vol: 13 (2)Pages: 300-300   Publisher: Multidisciplinary Digital Publishing Institute

Abstract

Image–text cross-modal retrieval aims to bridge the semantic gap between different modalities, allowing for the search of images based on textual descriptions or vice versa. Existing efforts in this field concentrate on coarse-grained feature representation and then utilize pairwise ranking loss to pull image–text positive pairs closer, pushing negative ones apart. However, using pairwise ranking loss directly on coarse-grained representation lacks reliability as it disregards fine-grained information, posing a challenge in narrowing the semantic gap between image and text. To this end, we propose an Instance Contrastive Embedding (IConE) method for image–text cross-modal retrieval. Specifically, we first transfer the multi-modal pre-training model to the cross-modal retrieval task to leverage the interactive information between image and text, thereby enhancing the model’s representational capabilities. Then, to comprehensively consider the feature distribution of intra- and inter-modality, we design a novel two-stage training strategy that combines instance loss and contrastive loss, dedicated to extracting fine-grained representation within instances and bridging the semantic gap between modalities. Extensive experiments on two public benchmark datasets, Flickr30k and MS-COCO, demonstrate that our IConE outperforms several state-of-the-art (SoTA) baseline methods and achieves competitive performance.

Keywords:
Computer science Semantic gap Pairwise comparison Artificial intelligence Embedding Leverage (statistics) Modal Feature (linguistics) Representation (politics) Image retrieval Ranking (information retrieval) Benchmark (surveying) Bridging (networking) Image (mathematics) Feature learning Pattern recognition (psychology) Information retrieval Discriminative model Natural language processing

Metrics

6
Cited By
3.18
FWCI (Field Weighted Citation Impact)
44
Refs
0.85
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Advanced Image and Video Retrieval Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Domain Adaptation and Few-Shot Learning
Physical Sciences →  Computer Science →  Artificial Intelligence
© 2026 ScienceGate Book Chapters — All rights reserved.