Multi-modal deep learning methods have achieved great improvements in visual grounding; their objective is to localize text-specified objects in images. Most of the existing methods can localize and classify objects with significant appearance differences but suffer from the misclassification problem for extremely similar objects, due to inadequate exploration of multi-modal features. To address this problem, we propose a novel semantic-aligned cross-modal visual grounding network with transformers (SAC-VGNet). SAC-VGNet integrates visual and textual features with semantic alignment to highlight important feature cues for capturing tiny differences between similar objects. Technically, SAC-VGNet incorporates a multi-modal fusion module to effectively fuse visual and textual descriptions. It also introduces contrastive learning to align linguistic and visual features on the text-to-pixel level, enabling the capture of subtle differences between objects. The overall architecture is end-to-end without the need for extra parameter settings. To evaluate our approach, we manually annotate text descriptions for images in two fine-grained visual grounding datasets. The experimental results demonstrate that SAC-VGNet significantly improves performance in fine-grained visual grounding.
Xin XuGang LvYining SunHU Yu-xiaFudong Nian
Ye DuZehua FuQingjie LiuYunhong Wang
Maofu LiuBingying ZhouHuijun HuChen QiuXiaokang Zhang
Liang LiuShuaiyong LiYongqiang Zhu
Yongfei LiuBo WanXiaodan ZhuXuming He