Learning Fine-Grained Information Alignment for Calibrated Cross-Modal Retrieval

Jianhua Dong; Shengrong Zhao; Liang Hu

doi:10.1109/icassp48485.2024.10448127

ScienceGate Book Chapters

JOURNAL ARTICLE

Learning Fine-Grained Information Alignment for Calibrated Cross-Modal Retrieval

Jianhua Dong Shengrong Zhao Liang Hu

Year: 2024 Vol: 35 Pages: 8286-8290

DOI: 10.1109/icassp48485.2024.10448127

Get Full-Text PDF Get Analytical Report

Abstract

Masked Language Modeling (MLM) and Image-Text Matching (ITM) are always used in fusion encoder to learn the joint representation of images and text. In existing methods, the masking strategy of MLM leads to the neglect of image details during the modeling process. Meanwhile, the sampling strategy of ITM struggles to consistently select high-difficulty hard negative instances, reducing the effectiveness of constraints. This leads to challenges in aligning fine-grained information in cross-modal retrieval. In response to this challenge, a fine-grained information alignment-based visual language model (FAM) is proposed in this paper. On one hand, the attribute-based masking strategy is employed in MLM, helping the model focus on the details of objects in images during modeling. On the other hand, the robust hard negative sample generation strategy provides challenging negative samples for ITM by altering the relationships between objects. This enables the model to align relationships between objects in different modalities and thus calibrates cross-modal retrieval. Extensive experiments demonstrate the effectiveness of the model in cross-modal retrieval tasks.

Keywords:

Computer science Focus (optics) Modal Encoder Process (computing) Masking (illustration) Artificial intelligence Representation (politics) Matching (statistics) Sampling (signal processing) Machine learning Information retrieval Computer vision

Metrics

Cited By

0.00

FWCI (Field Weighted Citation Impact)

Refs

0.03

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Topics

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Domain Adaptation and Few-Shot Learning

Physical Sciences → Computer Science → Artificial Intelligence

Advanced Image and Video Retrieval Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Learning Fine-Grained Information Alignment for Calibrated Cross-Modal Retrieval

Abstract

Metrics

Topics

Related Documents

Implicit Fine-Grained Alignment for Cross-Modal Retrieval

Fine-Grained Alignment for Cross-Modal Recipe Retrieval

Learning Relation Alignment for Calibrated Cross-modal Retrieval

Learning Relation Alignment for Calibrated Cross-modal Retrieval

Learning Relation Alignment for Calibrated Cross-modal Retrieval