Improving Image-Text Matching With Bidirectional Consistency of Cross-Modal Alignment

Zhe Li; Lei Zhang; Kun Zhang; Yongdong Zhang; Zhendong Mao

doi:10.1109/tcsvt.2024.3369656

ScienceGate Book Chapters

JOURNAL ARTICLE

Improving Image-Text Matching With Bidirectional Consistency of Cross-Modal Alignment

Zhe Li Lei Zhang Kun Zhang Yongdong Zhang Zhendong Mao

Year: 2024 Journal: IEEE Transactions on Circuits and Systems for Video Technology Vol: 34 (7)Pages: 6590-6607 Publisher: Institute of Electrical and Electronics Engineers

DOI: 10.1109/tcsvt.2024.3369656

Get Full-Text PDF Get Analytical Report

Abstract

Image-text matching is a fundamental task in bridging the semantics between vision and language. The key challenge lies in establishing accurate alignment between two heterogeneous modalities. Existing cross-modal fine-grained matching methods normally include two alignment directions, "word to region" and "region to word", and the overall image-text similarity is calculated from the alignments. However, the alignment of these two directions is typically independent, that is, the alignment of "word to region" and "region to word" is irrelevant, so the alignment consistency cannot be guaranteed in two directions, which inevitably introduces inconsistent alignments, leading to potential inaccurate image-text matching results. In this paper, we propose a novel Bidirectional cOnsistency netwOrks for cross-Modal alignment (BOOM), which achieves more accurate cross-modal semantic alignments by imposing explicit consistency constraints in both directions. Specifically, according to three aspects reflected by alignment consistency, i.e ., significance, wholeness, and alignment orderliness, we design a novel systematic multi-granularity consistency constraints: point-wise consistency, which enforces consistency of the most significant single word item in bidirectional alignments; set-wise consistency, which maintains more comprehensive and accurate bidirectional entire alignment values consistent and order-wise consistency, which ensures order consistency of bidirectional alignment results. Bidirectional cross-modal alignment between words and regions is corrected from three different perspectives: maximum, distribution, and order. Extensive experiments on two benchmarks, i.e ., Flickr30K and MS-COCO, demonstrate that our BOOM achieves state-of-the-art performance.

Keywords:

Computer science Modal Artificial intelligence Consistency (knowledge bases) Matching (statistics) Image matching Computer vision Image (mathematics) Pattern recognition (psychology) Mathematics Statistics

Metrics

Cited By

13.25

FWCI (Field Weighted Citation Impact)

Refs

0.98

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Advanced Image and Video Retrieval Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Image Retrieval and Classification Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Improving Image-Text Matching With Bidirectional Consistency of Cross-Modal Alignment

Abstract

Metrics

Citation History

Topics

Related Documents

Conceptual and Syntactical Cross-modal Alignment with Cross-level Consistency for Image-Text Matching

Improving Cross-modal Alignment for Text-Guided Image Inpainting

Cross-Modal Image-Text Retrieval with Semantic Consistency

Improving Cross-Modal Alignment with Synthetic Pairs for Text-Only Image Captioning

Image-Text Retrieval With Cross-Modal Semantic Importance Consistency