Image–Text Cross-Modal Retrieval with Instance Contrastive Embedding

Ruigeng Zeng; Wentao Ma; Xiaoqian Wu; Wei Liu; Jie Liu

doi:10.3390/electronics13020300

ScienceGate Book Chapters

JOURNAL ARTICLE

Image–Text Cross-Modal Retrieval with Instance Contrastive Embedding

Ruigeng Zeng Wentao Ma Xiaoqian Wu Wei Liu Jie Liu

Year: 2024 Journal: Electronics Vol: 13 (2)Pages: 300-300 Publisher: Multidisciplinary Digital Publishing Institute

DOI: 10.3390/electronics13020300

Get Full-Text PDF Get Analytical Report

Abstract

Image–text cross-modal retrieval aims to bridge the semantic gap between different modalities, allowing for the search of images based on textual descriptions or vice versa. Existing efforts in this field concentrate on coarse-grained feature representation and then utilize pairwise ranking loss to pull image–text positive pairs closer, pushing negative ones apart. However, using pairwise ranking loss directly on coarse-grained representation lacks reliability as it disregards fine-grained information, posing a challenge in narrowing the semantic gap between image and text. To this end, we propose an Instance Contrastive Embedding (IConE) method for image–text cross-modal retrieval. Specifically, we first transfer the multi-modal pre-training model to the cross-modal retrieval task to leverage the interactive information between image and text, thereby enhancing the model’s representational capabilities. Then, to comprehensively consider the feature distribution of intra- and inter-modality, we design a novel two-stage training strategy that combines instance loss and contrastive loss, dedicated to extracting fine-grained representation within instances and bridging the semantic gap between modalities. Extensive experiments on two public benchmark datasets, Flickr30k and MS-COCO, demonstrate that our IConE outperforms several state-of-the-art (SoTA) baseline methods and achieves competitive performance.

Keywords:

Computer science Semantic gap Pairwise comparison Artificial intelligence Embedding Leverage (statistics) Modal Feature (linguistics) Representation (politics) Image retrieval Ranking (information retrieval) Benchmark (surveying) Bridging (networking) Image (mathematics) Feature learning Pattern recognition (psychology) Information retrieval Discriminative model Natural language processing

Metrics

Cited By

3.18

FWCI (Field Weighted Citation Impact)

Refs

0.85

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Advanced Image and Video Retrieval Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Domain Adaptation and Few-Shot Learning

Physical Sciences → Computer Science → Artificial Intelligence

Image–Text Cross-Modal Retrieval with Instance Contrastive Embedding

Abstract

Metrics

Citation History

Topics

Related Documents

Improving text-image cross-modal retrieval with contrastive loss

Image-Text Embedding with Hierarchical Knowledge for Cross-Modal Retrieval

Webly Supervised Joint Embedding for Cross-Modal Image-Text Retrieval

Super Visual Semantic Embedding for Cross-Modal Image-Text Retrieval

Iterative Uni-modal and Cross-modal Clustered Contrastive Learning for Image-text Retrieval