Semantic-Aligned Cross-Modal Visual Grounding Network with Transformers

Qianjun Zhang; Jin Yuan

doi:10.3390/app13095649

ScienceGate Book Chapters

JOURNAL ARTICLE

Semantic-Aligned Cross-Modal Visual Grounding Network with Transformers

Qianjun Zhang Jin Yuan

Year: 2023 Journal: Applied Sciences Vol: 13 (9)Pages: 5649-5649 Publisher: Multidisciplinary Digital Publishing Institute

DOI: 10.3390/app13095649

Get Full-Text PDF Get Analytical Report

Abstract

Multi-modal deep learning methods have achieved great improvements in visual grounding; their objective is to localize text-specified objects in images. Most of the existing methods can localize and classify objects with significant appearance differences but suffer from the misclassification problem for extremely similar objects, due to inadequate exploration of multi-modal features. To address this problem, we propose a novel semantic-aligned cross-modal visual grounding network with transformers (SAC-VGNet). SAC-VGNet integrates visual and textual features with semantic alignment to highlight important feature cues for capturing tiny differences between similar objects. Technically, SAC-VGNet incorporates a multi-modal fusion module to effectively fuse visual and textual descriptions. It also introduces contrastive learning to align linguistic and visual features on the text-to-pixel level, enabling the capture of subtle differences between objects. The overall architecture is end-to-end without the need for extra parameter settings. To evaluate our approach, we manually annotate text descriptions for images in two fine-grained visual grounding datasets. The experimental results demonstrate that SAC-VGNet significantly improves performance in fine-grained visual grounding.

Keywords:

Computer science Modal Artificial intelligence Transformer Ground Fuse (electrical) Natural language processing Visual reasoning Computer vision Engineering

Metrics

Cited By

0.36

FWCI (Field Weighted Citation Impact)

Refs

0.52

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Advanced Image and Video Retrieval Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Domain Adaptation and Few-Shot Learning

Physical Sciences → Computer Science → Artificial Intelligence

Semantic-Aligned Cross-Modal Visual Grounding Network with Transformers

Abstract

Metrics

Citation History

Topics

Related Documents

Hierarchical cross-modal contextual attention network for visual grounding

Visual Grounding with Transformers

Cross-modal event extraction via Visual Event Grounding and Semantic Relation Filling

Cross-Modal Semantic-Aware Network for Audio-Visual Event Localization

Learning Cross-Modal Context Graph for Visual Grounding