Cross-Modality Spatial-Temporal Transformer for Video-Based Visible-Infrared Person Re-Identification

Yujian Feng; Feng Chen; Jian Yu; Yimu Ji; Fei Wu; Tianliang Liu; Shangdong Liu; Xiao‐Yuan Jing; Jiebo Luo

doi:10.1109/tmm.2024.3354575

ScienceGate Book Chapters

JOURNAL ARTICLE

Cross-Modality Spatial-Temporal Transformer for Video-Based Visible-Infrared Person Re-Identification

Yujian Feng Feng Chen Jian Yu Yimu Ji Fei Wu Tianliang Liu Shangdong Liu Xiao‐Yuan Jing Jiebo Luo

Year: 2024 Journal: IEEE Transactions on Multimedia Vol: 26 Pages: 6582-6594 Publisher: Institute of Electrical and Electronics Engineers

DOI: 10.1109/tmm.2024.3354575

Get Full-Text PDF Get Analytical Report

Abstract

Video-based visible-infrared person re-identification (VVI-ReID) aims to match the identity of a person captured in video sequences from both visible and infrared cameras. The VVI-ReID task requires considering both the spatial relationship between body parts within each frame and the temporal change of appearance between successive frames. Existing VVI Re-ID methods employ Convolutional Neural Networks to extract local spatial features and Long Short-Term Memory to form temporal associations. However, these methods can not effectively capture the global spatial feature and the long-range temporal dependencies in ultra-long sequences. In this paper, we propose a Cross-modality Spatial-temporal Transformer (CST) including a Cross-frame Tube Transformer Module (CTTM) and a Multi-frame Transformer Fusion Module (MTFM) to address these challenges. Firstly, CTTM tokenizes a video clip into multiple 3D tubes, each encapsulating local spatial-temporal information of pedestrians, and then obtains global spatial-temporal representations by establishing the relationship between tubes. Secondly, we design MTFM to exchange information between multiple frames using message tokens, thus modeling the long-range temporal dependencies of features of pedestrians. In addition, to prevent the potential representation collapse caused by triplet-based loss functions, we propose a diversity-consistency (DC) loss function to preserve the diversity and consistency of cross-modality feature representations by imposing variance, invariance, and covariance constraints in feature representations. Extensive benchmark experiments demonstrate that our approach outperforms the state-of-the-art methods with large margins.

Keywords:

Computer science Artificial intelligence Pattern recognition (psychology) Computer vision Transformer Feature learning Mutual information Spatial analysis Mathematics

Metrics

Cited By

5.83

FWCI (Field Weighted Citation Impact)

Refs

0.93

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Video Surveillance and Tracking Methods

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Gait Recognition and Analysis

Physical Sciences → Engineering → Biomedical Engineering

Human Pose and Action Recognition

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Cross-Modality Spatial-Temporal Transformer for Video-Based Visible-Infrared Person Re-Identification

Abstract

Metrics

Citation History

Topics

Related Documents

Cross-Modality Transformer for Visible-Infrared Person Re-Identification

Nystromformer based cross-modality transformer for visible-infrared person re-identification

Visible-Infrared Person Re-Identification via Cross-Modality Interaction Transformer

Cross-Modality Transformer With Modality Mining for Visible-Infrared Person Re-Identification

PDET: Progressive Diversity Expansion Transformer for Cross-Modality Visible-Infrared Person Re-identification