JOURNAL ARTICLE

Cross-Modality Spatial-Temporal Transformer for Video-Based Visible-Infrared Person Re-Identification

Yujian FengFeng ChenJian YuYimu JiFei WuTianliang LiuShangdong LiuXiao‐Yuan JingJiebo Luo

Year: 2024 Journal:   IEEE Transactions on Multimedia Vol: 26 Pages: 6582-6594   Publisher: Institute of Electrical and Electronics Engineers

Abstract

Video-based visible-infrared person re-identification (VVI-ReID) aims to match the identity of a person captured in video sequences from both visible and infrared cameras. The VVI-ReID task requires considering both the spatial relationship between body parts within each frame and the temporal change of appearance between successive frames. Existing VVI Re-ID methods employ Convolutional Neural Networks to extract local spatial features and Long Short-Term Memory to form temporal associations. However, these methods can not effectively capture the global spatial feature and the long-range temporal dependencies in ultra-long sequences. In this paper, we propose a Cross-modality Spatial-temporal Transformer (CST) including a Cross-frame Tube Transformer Module (CTTM) and a Multi-frame Transformer Fusion Module (MTFM) to address these challenges. Firstly, CTTM tokenizes a video clip into multiple 3D tubes, each encapsulating local spatial-temporal information of pedestrians, and then obtains global spatial-temporal representations by establishing the relationship between tubes. Secondly, we design MTFM to exchange information between multiple frames using message tokens, thus modeling the long-range temporal dependencies of features of pedestrians. In addition, to prevent the potential representation collapse caused by triplet-based loss functions, we propose a diversity-consistency (DC) loss function to preserve the diversity and consistency of cross-modality feature representations by imposing variance, invariance, and covariance constraints in feature representations. Extensive benchmark experiments demonstrate that our approach outperforms the state-of-the-art methods with large margins.

Keywords:
Computer science Artificial intelligence Pattern recognition (psychology) Computer vision Transformer Feature learning Mutual information Spatial analysis Mathematics

Metrics

11
Cited By
5.83
FWCI (Field Weighted Citation Impact)
64
Refs
0.93
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Video Surveillance and Tracking Methods
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Gait Recognition and Analysis
Physical Sciences →  Engineering →  Biomedical Engineering
Human Pose and Action Recognition
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
© 2026 ScienceGate Book Chapters — All rights reserved.