JOURNAL ARTICLE

Enhanced Semantic Similarity Learning Framework for Image-Text Matching

Kun ZhangBo HuHuatian ZhangZhe LiZhendong Mao

Year: 2023 Journal:   IEEE Transactions on Circuits and Systems for Video Technology Vol: 34 (4)Pages: 2973-2988   Publisher: Institute of Electrical and Electronics Engineers

Abstract

Image-text matching is a fundamental task to bridge vision and language. The critical challenge lies in accurately learning the semantic similarity between these two heterogeneous modalities. For visual and textual features, existing methods typically default to a static dimensional correspondence mechanism, i.e., using a single dimension as the measure-unit to perform one-to-one correspondence, to examine semantic similarity, e.g., the cosine/Euclidean distance or the weighted similarity. In this paper, different from the single-dimensional correspondence with limited semantic expressive capability, we propose a novel enhanced semantic similarity learning (ESL), which generalizes both measure-units and their correspondences into a dynamic learnable framework to examine the multi-dimensional enhanced correspondence between visual and textual features. Specifically, we first devise the intra-modal multi-dimensional aggregators with iterative enhancing mechanism, which dynamically captures new measure-units integrated by hierarchical multi-dimensions, producing diverse semantic combinatorial expressive capabilities to provide richer and discriminative information for similarity examination. Then, we devise the inter-modal enhanced correspondence learning with sparse contribution degrees, which comprehensively and efficiently determines the cross-modal semantic similarity. Extensive experiments verify its superiority in achieving state-of-the-art performance. Codes will be released at https://github.com/CrossmodalGroup/ESL .

Keywords:
Computer science Artificial intelligence Semantic similarity Image (mathematics) Similarity (geometry) Matching (statistics) Natural language processing Pattern recognition (psychology) Semantics (computer science) Information retrieval Mathematics

Metrics

21
Cited By
3.82
FWCI (Field Weighted Citation Impact)
78
Refs
0.92
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Topic Modeling
Physical Sciences →  Computer Science →  Artificial Intelligence
Advanced Image and Video Retrieval Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
© 2026 ScienceGate Book Chapters — All rights reserved.