Jianhua DongShengrong ZhaoLiang Hu
Masked Language Modeling (MLM) and Image-Text Matching (ITM) are always used in fusion encoder to learn the joint representation of images and text. In existing methods, the masking strategy of MLM leads to the neglect of image details during the modeling process. Meanwhile, the sampling strategy of ITM struggles to consistently select high-difficulty hard negative instances, reducing the effectiveness of constraints. This leads to challenges in aligning fine-grained information in cross-modal retrieval. In response to this challenge, a fine-grained information alignment-based visual language model (FAM) is proposed in this paper. On one hand, the attribute-based masking strategy is employed in MLM, helping the model focus on the details of objects in images during modeling. On the other hand, the robust hard negative sample generation strategy provides challenging negative samples for ITM by altering the relationships between objects. This enables the model to align relationships between objects in different modalities and thus calibrates cross-modal retrieval. Extensive experiments demonstrate the effectiveness of the model in cross-modal retrieval tasks.
Hui LiuXiaoping ChenRui HongYan ZhouTian-cai WanTai-li Bai
Muntasir WahedXiaona ZhouTianjiao YuIsmini Lourentzou
Shuhuai RenJunyang LinGuangxiang ZhaoRui MenYang AnJingren ZhouXu SunHongxia Yang
Siming RenJunyang LinGuangxiang ZhaoRui MenYang AnJingren ZhouXu SunHongxia Yang
Siming RenJunyang LinGuangxiang ZhaoRui MenYang AnJingren ZhouXu SunHongxia Yang