JOURNAL ARTICLE

CLIP-Driven Fine-Grained Text-Image Person Re-Identification

Shuanglin YanNeng DongLiyan ZhangJinhui Tang

Year: 2023 Journal:   IEEE Transactions on Image Processing Vol: 32 Pages: 6032-6046   Publisher: Institute of Electrical and Electronics Engineers

Abstract

Text-Image Person Re-identification (TIReID) aims to retrieve the image corresponding to the given text query from a pool of candidate images. Existing methods employ prior knowledge from single-modality pre-training to facilitate learning, but lack multi-modal correspondence information. Vision-Language Pre-training, such as CLIP (Contrastive Language-Image Pretraining), can address the limitation. However, CLIP falls short in capturing fine-grained information, thereby not fully leveraging its powerful capacity in TIReID. Besides, the popular explicit local matching paradigm for mining fine-grained information heavily relies on the quality of local parts and cross-modal inter-part interaction/guidance, leading to intra-modal information distortion and ambiguity problems. Accordingly, in this paper, we propose a CLIP-driven Fine-grained information excavation framework (CFine) to fully utilize the powerful knowledge of CLIP for TIReID. To transfer the multi-modal knowledge effectively, we conduct fine-grained information excavation to mine modality-shared discriminative details for global alignment. Specifically, we propose a multi-level global feature learning (MGF) module that fully mines the discriminative local information within each modality, thereby emphasizing identity-related discriminative clues through enhanced interaction between global image (text) and informative local patches (words). MGF generates a set of enhanced global features for later inference. Furthermore, we design cross-grained feature refinement (CFR) and fine-grained correspondence discovery (FCD) modules to establish cross-modal correspondence at both coarse and fine-grained levels (image-word, sentence-patch, word-patch), ensuring the reliability of informative local patches/words. CFR and FCD are removed during inference to optimize computational efficiency. Extensive experiments on multiple benchmarks demonstrate the superior performance of our method in TIReID.

Keywords:
Computer science Discriminative model Artificial intelligence Feature (linguistics) Modality (human–computer interaction) Inference Sentence Feature learning Identification (biology) Pattern recognition (psychology) Word (group theory) Image (mathematics) Modal Natural language processing

Metrics

223
Cited By
40.58
FWCI (Field Weighted Citation Impact)
70
Refs
1.00
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Video Surveillance and Tracking Methods
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Advanced Image and Video Retrieval Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition

Related Documents

JOURNAL ARTICLE

Fine-Grained Person Re-identification

Jiahang YinAncong WuWei‐Shi Zheng

Journal:   International Journal of Computer Vision Year: 2020 Vol: 128 (6)Pages: 1654-1672
JOURNAL ARTICLE

Cloth-Changing Person Re-Identification Method Based on CLIP Enhanced Fine-Grained Features

GENG Xia, WANG Yao

Journal:   DOAJ (DOAJ: Directory of Open Access Journals) Year: 2025
JOURNAL ARTICLE

TF-CLIP: Learning Text-Free CLIP for Video-Based Person Re-identification

Chenyang YuXuehu LiuYingquan WangPingping ZhangHuchuan Lu

Journal:   Proceedings of the AAAI Conference on Artificial Intelligence Year: 2024 Vol: 38 (7)Pages: 6764-6772
© 2026 ScienceGate Book Chapters — All rights reserved.