High Feature Distinguishability for Adaptive Image-text Matching with Dual-stream Transformers

Yangtao Wang; Weibin Huang; Yanzhao Xie; Siyuan Chen; Weilong Peng; Maobin Tang; Meie Fang; Wensheng Zhang

doi:10.1145/3779654

ScienceGate Book Chapters

JOURNAL ARTICLE

High Feature Distinguishability for Adaptive Image-text Matching with Dual-stream Transformers

Yangtao Wang Weibin Huang Yanzhao Xie Siyuan Chen Weilong Peng Maobin Tang Meie Fang Wensheng Zhang

Year: 2025 Journal: ACM Transactions on Multimedia Computing Communications and Applications Publisher: Association for Computing Machinery

DOI: 10.1145/3779654

Get Full-Text PDF Get Analytical Report

Abstract

Recently, most image-text matching (ITM) approaches have embraced a dual-stream transformer architecture to facilitate the learning and alignment of cross-modal semantic information. Despite the efficacy of this methodology in bridging the semantic disparity between images and texts, it exhibits two primary limitations. Firstly, it falls short in discriminating the nuanced similarities among features, which leads to misleading outcomes or even compromises the overall ITM process. Secondly, the conventional triplet training paradigm relies on a pre-determined, fixed margin coefficient, thereby impeding its capacity to accurately gauge the similarity relationships between positive and negative samples. In this paper, we propose high feature D istinguishability for A daptive I mage-text M atching with dual-stream transformers (termed as DAIM). To address the first limitation, we design a feature discriminability module to bring similar features closer together but with a certain degree of distinction and push dissimilar features farther apart, resulting in high feature distinguishability for accurate ITM. To address the second limitation, we devise a margin optimization module to perceive the similarity distribution between positive and negative samples in real time during training, thereby adaptively adjusting the margin coefficient to minimize the cross-modal semantic gap to the greatest extent possible. Based on this, we align the multi-level (i.e., representations from low-, middle-, and high-layer transformer encoders) semantic information of cross-modal data by adaptively optimizing the semantic distributions of positive and negative samples. We conduct extensive experiments on two commonly used benchmark datasets, including MSCOCO and Flickr30K. Experimental results verify that DAIM can achieve a higher performance (e.g., 4.7% RSUM gain on MSCOCO) than the state-of-the-art ITM methods. The open-sourced code of this project is available at: https://github.com/Hudjkfhdsjfhdjkg/DAIM.git.

Keywords:

Metrics

Cited By

0.00

FWCI (Field Weighted Citation Impact)

Refs

Citation Normalized Percentile

Is in top 1%

Is in top 10%

High Feature Distinguishability for Adaptive Image-text Matching with Dual-stream Transformers

Abstract

Metrics

Topics

Related Documents

Enhancing Image-Text Matching with Adaptive Feature Aggregation

Learning Geometric Feature Embedding with Transformers for Image Matching

Textual Concept Expansion with Commonsense Knowledge to Improve Dual-Stream Image-Text Matching

Adaptive feature mixing with Vision Transformers for clinical image analysis

Single Image Reflection Separation via Dual-Stream Interactive Transformers