JOURNAL ARTICLE

Text-and-Image Learning Transformer for Cross-Modal Person Re-Identification

Tinghui WuShuhe ZhangDihu ChenHaifeng Hu

Year: 2024 Journal:   ACM Transactions on Multimedia Computing Communications and Applications Vol: 21 (1)Pages: 1-18   Publisher: Association for Computing Machinery

Abstract

Text-based person re-identification aims to find the target person from a large pedestrian gallery with the given natural language description. Previous works mainly focus on embedding salient textual and visual representations in a common latent space by utilizing the dual-path structure or parameter-shared network. However, they still lack the ability to effectively extract fine-grained unimodal features as well as fuse the cross-modal data, leading to the increase of misaligned cases. To settle these issues, we propose a text-and-image implicit learning Transformer (TILT) to eliminate textual anisotropy and enhance the cross-modal alignment from both domains based on the bi-direction multi-modal encoders. Specifically, we apply the pre-trained multi-modal embedding module to overcome the unimodal anisotropy problem with contrastive learning, and map fine-grained features with dual encoder in bi-directional masking. Then, we design the cross-modal interaction encoder to comprehensively mine implicit cross-modal relations by reconstructing masked tokens, and fuse rich multi-modal knowledge in a common space. In addition, the cross-modal similarity matching module is proposed to optimize the intra-domain classification and decrease the inter-domain divergence. Extensive experiments are conducted on three public benchmarks CUHK-PEDES, ICFG-PEDES, and RSTPReid to verify the effectiveness of our proposed framework. Results prove that our model outperforms state-of-the-art methods on all metrics.

Keywords:
Computer science Transformer Modal Artificial intelligence Computer vision Identification (biology) Human–computer interaction Natural language processing Machine learning Voltage Electrical engineering

Metrics

1
Cited By
0.53
FWCI (Field Weighted Citation Impact)
63
Refs
0.58
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Video Surveillance and Tracking Methods
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Face recognition and analysis
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Gait Recognition and Analysis
Physical Sciences →  Engineering →  Biomedical Engineering

Related Documents

JOURNAL ARTICLE

Transformer network for cross-modal text-to-image person re-identification

Ding JiangMang Ye

Journal:   Journal of Image and Graphics Year: 2023 Vol: 28 (5)Pages: 1384-1395
JOURNAL ARTICLE

Adaptive Contrastive Cross-Modal Transformer for Enhanced Text-to-Image Person Re-Identification

Tsoy YelizavetaRahman TouhidurBishwas Shankar Palikhe

Journal:   International Journal of Innovative Research in Computer and Communication Engineering Year: 2025 Vol: 13 (06)
JOURNAL ARTICLE

Image–Text Person Re-Identification with Transformer-Based Modal Fusion

Xin LiH. X. GuoMeiling ZhangBo Fu

Journal:   Electronics Year: 2025 Vol: 14 (3)Pages: 525-525
JOURNAL ARTICLE

Cross-modal feature learning and alignment network for text–image person re-identification

Bailiang HuangXiaolong QiBin Chen

Journal:   Journal of Visual Communication and Image Representation Year: 2024 Vol: 103 Pages: 104219-104219
JOURNAL ARTICLE

Cross-Modal Alignment Enhancement Network for Text-to-Image Person Re-Identification

Di HeXinshan ZhuBin LiShenglu YueZhong Zhang

Journal:   IEEE Internet of Things Journal Year: 2025 Vol: 12 (24)Pages: 55046-55060
© 2026 ScienceGate Book Chapters — All rights reserved.