JOURNAL ARTICLE

Semantic-Aligned Cross-Modal Visual Grounding Network with Transformers

Qianjun ZhangJin Yuan

Year: 2023 Journal:   Applied Sciences Vol: 13 (9)Pages: 5649-5649   Publisher: Multidisciplinary Digital Publishing Institute

Abstract

Multi-modal deep learning methods have achieved great improvements in visual grounding; their objective is to localize text-specified objects in images. Most of the existing methods can localize and classify objects with significant appearance differences but suffer from the misclassification problem for extremely similar objects, due to inadequate exploration of multi-modal features. To address this problem, we propose a novel semantic-aligned cross-modal visual grounding network with transformers (SAC-VGNet). SAC-VGNet integrates visual and textual features with semantic alignment to highlight important feature cues for capturing tiny differences between similar objects. Technically, SAC-VGNet incorporates a multi-modal fusion module to effectively fuse visual and textual descriptions. It also introduces contrastive learning to align linguistic and visual features on the text-to-pixel level, enabling the capture of subtle differences between objects. The overall architecture is end-to-end without the need for extra parameter settings. To evaluate our approach, we manually annotate text descriptions for images in two fine-grained visual grounding datasets. The experimental results demonstrate that SAC-VGNet significantly improves performance in fine-grained visual grounding.

Keywords:
Computer science Modal Artificial intelligence Transformer Ground Fuse (electrical) Natural language processing Visual reasoning Computer vision Engineering

Metrics

2
Cited By
0.36
FWCI (Field Weighted Citation Impact)
53
Refs
0.52
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Advanced Image and Video Retrieval Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Domain Adaptation and Few-Shot Learning
Physical Sciences →  Computer Science →  Artificial Intelligence

Related Documents

JOURNAL ARTICLE

Hierarchical cross-modal contextual attention network for visual grounding

Xin XuGang LvYining SunHU Yu-xiaFudong Nian

Journal:   Multimedia Systems Year: 2023 Vol: 29 (4)Pages: 2073-2083
JOURNAL ARTICLE

Visual Grounding with Transformers

Ye DuZehua FuQingjie LiuYunhong Wang

Journal:   2022 IEEE International Conference on Multimedia and Expo (ICME) Year: 2022 Pages: 1-6
JOURNAL ARTICLE

Cross-modal event extraction via Visual Event Grounding and Semantic Relation Filling

Maofu LiuBingying ZhouHuijun HuChen QiuXiaokang Zhang

Journal:   Information Processing & Management Year: 2024 Vol: 62 (3)Pages: 104027-104027
JOURNAL ARTICLE

Learning Cross-Modal Context Graph for Visual Grounding

Yongfei LiuBo WanXiaodan ZhuXuming He

Journal:   Proceedings of the AAAI Conference on Artificial Intelligence Year: 2020 Vol: 34 (07)Pages: 11645-11652
© 2026 ScienceGate Book Chapters — All rights reserved.