Iterative Uni-modal and Cross-modal Clustered Contrastive Learning for Image-text Retrieval

Yi Zhu; Xiu Li

doi:10.1109/prmvia58252.2023.00009

ScienceGate Book Chapters

JOURNAL ARTICLE

Iterative Uni-modal and Cross-modal Clustered Contrastive Learning for Image-text Retrieval

Yi Zhu Xiu Li

Year: 2023 Vol: 34 Pages: 15-23

DOI: 10.1109/prmvia58252.2023.00009

Get Full-Text PDF Get Analytical Report

Abstract

Multimedia data has exploded both in quantity and form. Under such background, cross-modal retrieval has become a research hot spot in recent years. We address the image-to-text and text-to-image retrieval problems by proposing a symmetric two-stream pre-training framework. In this work, the architecture is based on the CLIP model and it consists of a BERT-pretrained text encoder and a Vision Transformer (ViT)-pretrained image encoder. We utilize not only a cross-modal contrastive loss, but also two symmetric uni-modal contrast losses to train the model in an unsupervised manner. In addition, we propose novel training strategies, including the multi-stage training scheme and iterative training strategy with clustered hard negative data. Experimental results show that our model achieves better performance via introducing the uni-modal self-supervised branch and losses compared to the sole CLIP model.

Keywords:

Computer science Modal Encoder Transformer Artificial intelligence Image retrieval Scheme (mathematics) Image (mathematics) Contrast (vision) Pattern recognition (psychology) Speech recognition Computer vision Machine learning Voltage Engineering Mathematics

Metrics

Cited By

0.00

FWCI (Field Weighted Citation Impact)

Refs

0.05

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Topics

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Advanced Image and Video Retrieval Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Domain Adaptation and Few-Shot Learning

Physical Sciences → Computer Science → Artificial Intelligence

Iterative Uni-modal and Cross-modal Clustered Contrastive Learning for Image-text Retrieval

Abstract

Metrics

Topics

Related Documents

Cross-modal Contrastive Learning for Generalizable and Efficient Image-text Retrieval

Image–Text Cross-Modal Retrieval with Instance Contrastive Embedding

Improving text-image cross-modal retrieval with contrastive loss

Contrastive Learning‐Based Fine‐Tuning Method for Cross‐Modal Text‐Image Retrieval

Iterative Matching with Text Generation for Cross-Modal Image-Text Retrieval