JOURNAL ARTICLE

Fine-Grained Alignment Network for Zero-Shot Cross-Modal Retrieval

Shiping GeZhiwei JiangYafeng YinCong WangZifeng ChengQing Gu

Year: 2025 Journal:   ACM Transactions on Multimedia Computing Communications and Applications Vol: 21 (10)Pages: 1-24   Publisher: Association for Computing Machinery

Abstract

Zero-Shot Cross-Modal Retrieval (ZS-CMR) aims to perform cross-modal retrieval on data of unseen classes, where a key challenge is how to address the modality-gap and domain-shift problems simultaneously. Existing methods tackle this challenge mainly by embracing a sample-label alignment paradigm, which aligns samples of different modalities but of the same class with the word embedding of their class label. However, these methods only focus on the class-level alignment and overlook the alignment of rich fine-grained semantic information in samples, incurring coarse understanding of sample matching and poor generalization on unseen classes. In this article, we propose a novel Fine-Grained Alignment Network, an end-to-end framework that learns representation with two fine-grained alignment strategies, yielding representation space that can be better generalized to unseen classes. Specifically, we extract two kinds of fine-grained representations, region embedding and label distribution, respectively, from aspects of both feature and label. To optimize the region embedding, we propose a Fine-Grained Contrastive Learning (FGCL) strategy to simultaneously conduct class-level alignment and model the intra-class discrepancy. To optimize the label distribution, we propose a Fine-Grained Label Alignment (FGLA) strategy to align diverse fine-grained semantic information among samples, rather than merely label information. Finally, both region embedding and label distribution are utilized together to perform ZS-CMR at a finer granularity. Experimental results on three widely used datasets demonstrate that our method outperforms the state-of-the-art methods by a large margin. Detailed ablation studies have also been carried out, which provably affirm the advantage of each component we propose. Our code will be available at https://github.com/ShipingGe/FGAN .

Keywords:
Computer science Modal Shot (pellet) Zero (linguistics) Artificial intelligence Information retrieval

Metrics

0
Cited By
0.00
FWCI (Field Weighted Citation Impact)
35
Refs
0.05
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Topics

Advanced Image and Video Retrieval Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Domain Adaptation and Few-Shot Learning
Physical Sciences →  Computer Science →  Artificial Intelligence
Advanced Neural Network Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
© 2026 ScienceGate Book Chapters — All rights reserved.