RA-CLIP: Retrieval Augmented Contrastive Language-Image Pre-Training

Chen-Wei Xie; Siyang Sun; Xiong Xiong; Yun Zheng; Deli Zhao; Jingren Zhou

doi:10.1109/cvpr52729.2023.01846

ScienceGate Book Chapters

JOURNAL ARTICLE

RA-CLIP: Retrieval Augmented Contrastive Language-Image Pre-Training

Chen-Wei Xie Siyang Sun Xiong Xiong Yun Zheng Deli Zhao Jingren Zhou

Year: 2023 Pages: 19265-19274

DOI: 10.1109/cvpr52729.2023.01846

Get Full-Text PDF Get Analytical Report

Abstract

Contrastive Language-Image Pre-training (CLIP) is attracting increasing attention for its impressive zero-shot recognition performance on different down-stream tasks. However, training CLIP is data-hungry and requires lots of image-text pairs to memorize various semantic concepts. In this paper, we propose a novel and efficient framework: Retrieval Augmented Contrastive Language-Image Pre-training (RA-CLIP) to augment embeddings by online retrieval. Specifically, we sample part of image-text data as a hold-out reference set. Given an input image, relevant image-text pairs are retrieved from the reference set to enrich the representation of input image. This process can be considered as an open-book exam: with the reference set as a cheat sheet, the proposed method doesn't need to memorize all visual concepts in the training data. It explores how to recognize visual concepts by exploiting correspondence between images and texts in the cheat sheet. The proposed RA-CLIP implements this idea and comprehensive experiments are conducted to show how RA-CLIP works. Performances on 10 image classification datasets and 2 object detection datasets show that RA-CLIP outperforms vanilla CLIP baseline by a large margin on zero-shot image classification task (+12.7%), linear probe image classification task (+6.9%) and zero-shot ROI classification task (+2.8%).

Keywords:

Computer science Artificial intelligence Margin (machine learning) Task (project management) Set (abstract data type) Image retrieval Image (mathematics) Memorization Pattern recognition (psychology) Contextual image classification Natural language processing Object (grammar) Training set Standard test image Computer vision Machine learning Image processing Mathematics

Metrics

Cited By

4.19

FWCI (Field Weighted Citation Impact)

Refs

0.93

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Domain Adaptation and Few-Shot Learning

Physical Sciences → Computer Science → Artificial Intelligence

Advanced Image and Video Retrieval Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

RA-CLIP: Retrieval Augmented Contrastive Language-Image Pre-Training

Abstract

Metrics

Citation History

Topics

Related Documents

RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-Training

Review on Optimizing Text-Video Retrieval Using CLIP (Contrastive Language-Image Pre-training)

PMC-CLIP: Contrastive Language-Image Pre-training Using Biomedical Documents

Construction safety inspection with contrastive language-image pre-training (CLIP) image captioning and attention

AdFair-CLIP: Adversarial Fair Contrastive Language-Image Pre-training for Chest X-Rays