JOURNAL ARTICLE

Dual-Stream Transformer With Distribution Alignment for Visible-Infrared Person Re-Identification

Zehua ChaiYongguo LingZhiming LuoDazhen LinMin JiangShaozi Li

Year: 2023 Journal:   IEEE Transactions on Circuits and Systems for Video Technology Vol: 33 (11)Pages: 6764-6776   Publisher: Institute of Electrical and Electronics Engineers

Abstract

Visible-infrared person re-identification(VI-ReID) aims to match the person images captured by visible and infrared cameras and suffers from severe cross-modality discrepancy and intra-modality variations. Existing approaches mainly use convolution neural network (CNN)-based architectures to extract pedestrian features, which fail to capture the long-range dependencies within an image. In addition, previous works usually attempt to bridge the modality gap by using adversarial learning to generate style-consistent images or designing different feature-level metric learning constraints. However, few works consider the cross-modality disparity from the perspective of assessing overall distance distribution discrepancy. To address these problems, we design a pure Transformer-based Visible-Infrared (TransVI) network with a conventional two-stream structure, which can explicitly capture modality-specific representations and learn multi-modality sharable knowledge. TransVI can efficiently address the lack of global dependency in CNN-based architectures due to the multi-head self-attention modules in the transformer, which allows us to capture the long-range dependencies of pedestrian images. Furthermore, we introduce the Cross-Modality Dissimilarity-based Maximum Mean Discrepancy (CMD-MMD) constraint to handle the cross-modality discrepancy at the distance distribution level. Specifically, CMD-MMD leverages intra-modality distribution separability to guide inter-modality distribution separability learning, aligning pair-wise distance distributions of intra- and inter-modality for within-class and between-class, respectively. In this way, the distance distributions of intra- and inter-modality become more similar, significantly mitigating the cross-modality discrepancy and learning more modality invariant representations. Extensive experimental results on two public VI-ReID datasets confirm that our proposed framework can achieve state-of-the-art performance.

Keywords:
Artificial intelligence Computer science Modality (human–computer interaction) Pattern recognition (psychology) Computer vision Convolutional neural network Deep learning

Metrics

40
Cited By
7.28
FWCI (Field Weighted Citation Impact)
81
Refs
0.97
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Video Surveillance and Tracking Methods
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Advanced Neural Network Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Image Enhancement Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
© 2026 ScienceGate Book Chapters — All rights reserved.