JOURNAL ARTICLE

Improving Image-Text Matching With Bidirectional Consistency of Cross-Modal Alignment

Zhe LiLei ZhangKun ZhangYongdong ZhangZhendong Mao

Year: 2024 Journal:   IEEE Transactions on Circuits and Systems for Video Technology Vol: 34 (7)Pages: 6590-6607   Publisher: Institute of Electrical and Electronics Engineers

Abstract

Image-text matching is a fundamental task in bridging the semantics between vision and language. The key challenge lies in establishing accurate alignment between two heterogeneous modalities. Existing cross-modal fine-grained matching methods normally include two alignment directions, "word to region" and "region to word", and the overall image-text similarity is calculated from the alignments. However, the alignment of these two directions is typically independent, that is, the alignment of "word to region" and "region to word" is irrelevant, so the alignment consistency cannot be guaranteed in two directions, which inevitably introduces inconsistent alignments, leading to potential inaccurate image-text matching results. In this paper, we propose a novel Bidirectional cOnsistency netwOrks for cross-Modal alignment (BOOM), which achieves more accurate cross-modal semantic alignments by imposing explicit consistency constraints in both directions. Specifically, according to three aspects reflected by alignment consistency, i.e ., significance, wholeness, and alignment orderliness, we design a novel systematic multi-granularity consistency constraints: point-wise consistency, which enforces consistency of the most significant single word item in bidirectional alignments; set-wise consistency, which maintains more comprehensive and accurate bidirectional entire alignment values consistent and order-wise consistency, which ensures order consistency of bidirectional alignment results. Bidirectional cross-modal alignment between words and regions is corrected from three different perspectives: maximum, distribution, and order. Extensive experiments on two benchmarks, i.e ., Flickr30K and MS-COCO, demonstrate that our BOOM achieves state-of-the-art performance.

Keywords:
Computer science Modal Artificial intelligence Consistency (knowledge bases) Matching (statistics) Image matching Computer vision Image (mathematics) Pattern recognition (psychology) Mathematics Statistics

Metrics

25
Cited By
13.25
FWCI (Field Weighted Citation Impact)
73
Refs
0.98
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Advanced Image and Video Retrieval Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Image Retrieval and Classification Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition

Related Documents

© 2026 ScienceGate Book Chapters — All rights reserved.