CLIP-Driven Fine-Grained Text-Image Person Re-Identification

Shuanglin Yan; Neng Dong; Liyan Zhang; Jinhui Tang

doi:10.1109/tip.2023.3327924

ScienceGate Book Chapters

JOURNAL ARTICLE

CLIP-Driven Fine-Grained Text-Image Person Re-Identification

Shuanglin Yan Neng Dong Liyan Zhang Jinhui Tang

Year: 2023 Journal: IEEE Transactions on Image Processing Vol: 32 Pages: 6032-6046 Publisher: Institute of Electrical and Electronics Engineers

DOI: 10.1109/tip.2023.3327924

Get Full-Text PDF Get Analytical Report

Abstract

Text-Image Person Re-identification (TIReID) aims to retrieve the image corresponding to the given text query from a pool of candidate images. Existing methods employ prior knowledge from single-modality pre-training to facilitate learning, but lack multi-modal correspondence information. Vision-Language Pre-training, such as CLIP (Contrastive Language-Image Pretraining), can address the limitation. However, CLIP falls short in capturing fine-grained information, thereby not fully leveraging its powerful capacity in TIReID. Besides, the popular explicit local matching paradigm for mining fine-grained information heavily relies on the quality of local parts and cross-modal inter-part interaction/guidance, leading to intra-modal information distortion and ambiguity problems. Accordingly, in this paper, we propose a CLIP-driven Fine-grained information excavation framework (CFine) to fully utilize the powerful knowledge of CLIP for TIReID. To transfer the multi-modal knowledge effectively, we conduct fine-grained information excavation to mine modality-shared discriminative details for global alignment. Specifically, we propose a multi-level global feature learning (MGF) module that fully mines the discriminative local information within each modality, thereby emphasizing identity-related discriminative clues through enhanced interaction between global image (text) and informative local patches (words). MGF generates a set of enhanced global features for later inference. Furthermore, we design cross-grained feature refinement (CFR) and fine-grained correspondence discovery (FCD) modules to establish cross-modal correspondence at both coarse and fine-grained levels (image-word, sentence-patch, word-patch), ensuring the reliability of informative local patches/words. CFR and FCD are removed during inference to optimize computational efficiency. Extensive experiments on multiple benchmarks demonstrate the superior performance of our method in TIReID.

Keywords:

Computer science Discriminative model Artificial intelligence Feature (linguistics) Modality (human–computer interaction) Inference Sentence Feature learning Identification (biology) Pattern recognition (psychology) Word (group theory) Image (mathematics) Modal Natural language processing

Metrics

223

Cited By

40.58

FWCI (Field Weighted Citation Impact)

Refs

1.00

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Video Surveillance and Tracking Methods

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Advanced Image and Video Retrieval Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

CLIP-Driven Fine-Grained Text-Image Person Re-Identification

Abstract

Metrics

Citation History

Topics

Related Documents

Fine-Grained Person Re-identification

Cloth-Changing Person Re-Identification Method Based on CLIP Enhanced Fine-Grained Features

Fine-grained text-based person re-identification via interlaced cross-attention and LoRA fine-tuning

Person Re-identification Method Using Text Description Through CLIP

TF-CLIP: Learning Text-Free CLIP for Video-Based Person Re-identification