JOURNAL ARTICLE

Learning Fine-Grained Information Alignment for Calibrated Cross-Modal Retrieval

Abstract

Masked Language Modeling (MLM) and Image-Text Matching (ITM) are always used in fusion encoder to learn the joint representation of images and text. In existing methods, the masking strategy of MLM leads to the neglect of image details during the modeling process. Meanwhile, the sampling strategy of ITM struggles to consistently select high-difficulty hard negative instances, reducing the effectiveness of constraints. This leads to challenges in aligning fine-grained information in cross-modal retrieval. In response to this challenge, a fine-grained information alignment-based visual language model (FAM) is proposed in this paper. On one hand, the attribute-based masking strategy is employed in MLM, helping the model focus on the details of objects in images during modeling. On the other hand, the robust hard negative sample generation strategy provides challenging negative samples for ITM by altering the relationships between objects. This enables the model to align relationships between objects in different modalities and thus calibrates cross-modal retrieval. Extensive experiments demonstrate the effectiveness of the model in cross-modal retrieval tasks.

Keywords:
Computer science Focus (optics) Modal Encoder Process (computing) Masking (illustration) Artificial intelligence Representation (politics) Matching (statistics) Sampling (signal processing) Machine learning Information retrieval Computer vision

Metrics

0
Cited By
0.00
FWCI (Field Weighted Citation Impact)
24
Refs
0.03
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Topics

Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Domain Adaptation and Few-Shot Learning
Physical Sciences →  Computer Science →  Artificial Intelligence
Advanced Image and Video Retrieval Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
© 2026 ScienceGate Book Chapters — All rights reserved.