Yangtao WangWeibin HuangYanzhao XieSiyuan ChenWeilong PengMaobin TangMeie FangWensheng Zhang
Recently, most image-text matching (ITM) approaches have embraced a dual-stream transformer architecture to facilitate the learning and alignment of cross-modal semantic information. Despite the efficacy of this methodology in bridging the semantic disparity between images and texts, it exhibits two primary limitations. Firstly, it falls short in discriminating the nuanced similarities among features, which leads to misleading outcomes or even compromises the overall ITM process. Secondly, the conventional triplet training paradigm relies on a pre-determined, fixed margin coefficient, thereby impeding its capacity to accurately gauge the similarity relationships between positive and negative samples. In this paper, we propose high feature D istinguishability for A daptive I mage-text M atching with dual-stream transformers (termed as DAIM). To address the first limitation, we design a feature discriminability module to bring similar features closer together but with a certain degree of distinction and push dissimilar features farther apart, resulting in high feature distinguishability for accurate ITM. To address the second limitation, we devise a margin optimization module to perceive the similarity distribution between positive and negative samples in real time during training, thereby adaptively adjusting the margin coefficient to minimize the cross-modal semantic gap to the greatest extent possible. Based on this, we align the multi-level (i.e., representations from low-, middle-, and high-layer transformer encoders) semantic information of cross-modal data by adaptively optimizing the semantic distributions of positive and negative samples. We conduct extensive experiments on two commonly used benchmark datasets, including MSCOCO and Flickr30K. Experimental results verify that DAIM can achieve a higher performance (e.g., 4.7% RSUM gain on MSCOCO) than the state-of-the-art ITM methods. The open-sourced code of this project is available at: https://github.com/Hudjkfhdsjfhdjkg/DAIM.git.
Zuhui WangYunting YinI. V. Ramakrishnan
Mingliang LiangZhuoran LiuMartha Larson
Xiaojie GuoQiming HuHainuo Wang