JOURNAL ARTICLE

Hierarchical Attention Image-Text Alignment Network For Person Re-Identification

Abstract

Description based Person Re-identification (Re-ID) is a crucial cross-modality task that aims at retrieving a specific person for the given textual description. Existing description based Re-ID methods focus on learning robust representations to effectively measure the similarity between the global features of two modalities. However, such global mapping disregards semantic consistencies between local visual and linguistic features. Further, there are major challenges of alignment uncertainty that occur due to poor correspondence of text-image pairs and text complexity arising due to the irrelevant words. Towards this, we propose an end-to-end Hierarchical Attention Image-Text Alignment Network, named as HAITA-Net. Our model comprises of: i) a hierarchical attention alignment network to determine the potential relationships of image content and textual information at different levels, namely, word-patch level, phrase-patch level, and sentence-image level for addressing alignment uncertainty; ii) a new strategy of Term Frequency-Inverse document Frequency thresholding to extract the salient tokens to alleviate the challenge of text complexity. The network is optimized via joint weighted hierarchical attention loss and cross-modal loss in an end-to-end manner. Extensive experiments demonstrate the effectiveness of our method.

Keywords:
Computer science Artificial intelligence Identification (biology) Similarity (geometry) Natural language processing Sentence Pattern recognition (psychology) Focus (optics) Phrase Modality (human–computer interaction) Salient Attention network Image (mathematics)

Metrics

4
Cited By
0.41
FWCI (Field Weighted Citation Impact)
38
Refs
0.60
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Video Surveillance and Tracking Methods
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Human Pose and Action Recognition
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition

Related Documents

JOURNAL ARTICLE

Cross-Modal Alignment Enhancement Network for Text-to-Image Person Re-Identification

Di HeXinshan ZhuBin LiShenglu YueZhong Zhang

Journal:   IEEE Internet of Things Journal Year: 2025 Vol: 12 (24)Pages: 55046-55060
JOURNAL ARTICLE

Cross-modal feature learning and alignment network for text–image person re-identification

Bailiang HuangXiaolong QiBin Chen

Journal:   Journal of Visual Communication and Image Representation Year: 2024 Vol: 103 Pages: 104219-104219
JOURNAL ARTICLE

Implicit Alignment-Based Cross-Modal Symbiotic Network for Text-to-Image Person Re-Identification

Rui SunYun DuGuoxi HuangX. D. WangJingjing Wu

Journal:   IEEE Transactions on Information Forensics and Security Year: 2025 Vol: 20 Pages: 8069-8082
JOURNAL ARTICLE

Fine-grained alignment network and local attention network for person re-identification

Dongming ZhouCanlong ZhangYanping TangZhixin Li

Journal:   Multimedia Tools and Applications Year: 2022 Vol: 81 (30)Pages: 43267-43281
© 2026 ScienceGate Book Chapters — All rights reserved.