JOURNAL ARTICLE

Iterative Uni-modal and Cross-modal Clustered Contrastive Learning for Image-text Retrieval

Abstract

Multimedia data has exploded both in quantity and form. Under such background, cross-modal retrieval has become a research hot spot in recent years. We address the image-to-text and text-to-image retrieval problems by proposing a symmetric two-stream pre-training framework. In this work, the architecture is based on the CLIP model and it consists of a BERT-pretrained text encoder and a Vision Transformer (ViT)-pretrained image encoder. We utilize not only a cross-modal contrastive loss, but also two symmetric uni-modal contrast losses to train the model in an unsupervised manner. In addition, we propose novel training strategies, including the multi-stage training scheme and iterative training strategy with clustered hard negative data. Experimental results show that our model achieves better performance via introducing the uni-modal self-supervised branch and losses compared to the sole CLIP model.

Keywords:
Computer science Modal Encoder Transformer Artificial intelligence Image retrieval Scheme (mathematics) Image (mathematics) Contrast (vision) Pattern recognition (psychology) Speech recognition Computer vision Machine learning Voltage Engineering Mathematics

Metrics

0
Cited By
0.00
FWCI (Field Weighted Citation Impact)
24
Refs
0.05
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Topics

Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Advanced Image and Video Retrieval Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Domain Adaptation and Few-Shot Learning
Physical Sciences →  Computer Science →  Artificial Intelligence

Related Documents

JOURNAL ARTICLE

Cross-modal Contrastive Learning for Generalizable and Efficient Image-text Retrieval

Haoyu LuYuqi HuoMingyu DingNanyi FeiZhiwu Lu

Journal:   Machine Intelligence Research Year: 2023 Vol: 20 (4)Pages: 569-582
JOURNAL ARTICLE

Image–Text Cross-Modal Retrieval with Instance Contrastive Embedding

Ruigeng ZengWentao MaXiaoqian WuWei LiuJie Liu

Journal:   Electronics Year: 2024 Vol: 13 (2)Pages: 300-300
JOURNAL ARTICLE

Improving text-image cross-modal retrieval with contrastive loss

Chumeng ZhangYue YangJunbo GuoGuoqing JinDan SongAn An Liu

Journal:   Multimedia Systems Year: 2022 Vol: 29 (2)Pages: 569-575
JOURNAL ARTICLE

Contrastive Learning‐Based Fine‐Tuning Method for Cross‐Modal Text‐Image Retrieval

Wei ZhaoXuan MaWeigang Wang

Journal:   Concurrency and Computation Practice and Experience Year: 2025 Vol: 37 (21-22)
JOURNAL ARTICLE

Iterative Matching with Text Generation for Cross-Modal Image-Text Retrieval

Yingying PanQing MaCong Bai

Journal:   Journal of Computer-Aided Design & Computer Graphics Year: 2025 Vol: 37 (5)Pages: 856-864
© 2026 ScienceGate Book Chapters — All rights reserved.