Text-and-Image Learning Transformer for Cross-Modal Person Re-Identification

Tinghui Wu; Shuhe Zhang; Dihu Chen; Haifeng Hu

doi:10.1145/3686160

ScienceGate Book Chapters

JOURNAL ARTICLE

Text-and-Image Learning Transformer for Cross-Modal Person Re-Identification

Tinghui Wu Shuhe Zhang Dihu Chen Haifeng Hu

Year: 2024 Journal: ACM Transactions on Multimedia Computing Communications and Applications Vol: 21 (1)Pages: 1-18 Publisher: Association for Computing Machinery

DOI: 10.1145/3686160

Get Full-Text PDF Get Analytical Report

Abstract

Text-based person re-identification aims to find the target person from a large pedestrian gallery with the given natural language description. Previous works mainly focus on embedding salient textual and visual representations in a common latent space by utilizing the dual-path structure or parameter-shared network. However, they still lack the ability to effectively extract fine-grained unimodal features as well as fuse the cross-modal data, leading to the increase of misaligned cases. To settle these issues, we propose a text-and-image implicit learning Transformer (TILT) to eliminate textual anisotropy and enhance the cross-modal alignment from both domains based on the bi-direction multi-modal encoders. Specifically, we apply the pre-trained multi-modal embedding module to overcome the unimodal anisotropy problem with contrastive learning, and map fine-grained features with dual encoder in bi-directional masking. Then, we design the cross-modal interaction encoder to comprehensively mine implicit cross-modal relations by reconstructing masked tokens, and fuse rich multi-modal knowledge in a common space. In addition, the cross-modal similarity matching module is proposed to optimize the intra-domain classification and decrease the inter-domain divergence. Extensive experiments are conducted on three public benchmarks CUHK-PEDES, ICFG-PEDES, and RSTPReid to verify the effectiveness of our proposed framework. Results prove that our model outperforms state-of-the-art methods on all metrics.

Keywords:

Computer science Transformer Modal Artificial intelligence Computer vision Identification (biology) Human–computer interaction Natural language processing Machine learning Voltage Electrical engineering

Metrics

Cited By

0.53

FWCI (Field Weighted Citation Impact)

Refs

0.58

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Video Surveillance and Tracking Methods

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Face recognition and analysis

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Gait Recognition and Analysis

Physical Sciences → Engineering → Biomedical Engineering

Text-and-Image Learning Transformer for Cross-Modal Person Re-Identification

Abstract

Metrics

Citation History

Topics

Related Documents

Transformer network for cross-modal text-to-image person re-identification

Adaptive Contrastive Cross-Modal Transformer for Enhanced Text-to-Image Person Re-Identification

Image–Text Person Re-Identification with Transformer-Based Modal Fusion

Cross-modal feature learning and alignment network for text–image person re-identification

Cross-Modal Alignment Enhancement Network for Text-to-Image Person Re-Identification