Language Query-Based Transformer With Multiscale Cross-Modal Alignment for Visual Grounding on Remote Sensing Images

Meng Lan; Fu Rong; Hongzan Jiao; Zhi Gao; Lefei Zhang

doi:10.1109/tgrs.2024.3407598

ScienceGate Book Chapters

JOURNAL ARTICLE

Language Query-Based Transformer With Multiscale Cross-Modal Alignment for Visual Grounding on Remote Sensing Images

Meng Lan Fu Rong Hongzan Jiao Zhi Gao Lefei Zhang

Year: 2024 Journal: IEEE Transactions on Geoscience and Remote Sensing Vol: 62 Pages: 1-13 Publisher: Institute of Electrical and Electronics Engineers

DOI: 10.1109/tgrs.2024.3407598

Get Full-Text PDF Get Analytical Report

Abstract

Visual grounding for remote sensing images (RSVG) aims to localize the referred objects in the remote sensing (RS) images according to a language expression. Existing methods tend to align visual and text features followed by concatenation and then employ a fusion Transformer to learn a token representation for final target localization. However, simple fusion Transformer structure fails to sufficiently learn the location representation of referred object from the multi-modal features. Inspired by the detection Transformer, in this paper, we propose a novel language query based Transformer framework for RSVG termed LQVG. Specifically, we adopt the extracted sentence-level text features as the queries, called language queries, to retrieve and aggregate representation information of the referred object from the multi-scale visual features in the Transformer decoder. The language queries are then converted into object embeddings for final coordinate prediction of referred object. Besides, a multi-scale cross-modal alignment module is devised before the multimodal Transformer to enhance the semantic correlation between the visual and text features, thus facilitating the cross-modal decoding process to generate more precise object representation. Moreover, a new RSVG dataset named RSVG-HR is built to evaluate the performance of the RSVG approaches on very high-resolution remote sensing images with inconspicuous objects. Experimental results on two benchmark datasets demonstrate that our proposed method significantly surpasses the comparison methods and achieves state-of-the-art performance. The dataset and code are available at https://github.com/LANMNG/LQVG.

Keywords:

Computer science Transformer Ground Modal Remote sensing Scale (ratio) Computer vision Artificial intelligence Voltage Geology Electrical engineering Engineering Geography

Metrics

Cited By

82.75

FWCI (Field Weighted Citation Impact)

Refs

1.00

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Geographic Information Systems Studies

Social Sciences → Social Sciences → Geography, Planning and Development

Advanced Image and Video Retrieval Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Image Retrieval and Classification Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Language Query-Based Transformer With Multiscale Cross-Modal Alignment for Visual Grounding on Remote Sensing Images

Abstract

Metrics

Citation History

Topics

Related Documents

Language-Guided Progressive Query Refinement With Diffusion Model for Visual Grounding on Remote Sensing Images

MAGRET: A Dataset for Multi-Target Visual Grounding in Remote Sensing Images with Cross-Modal Annotations

Visual Grounding in Remote Sensing Images

Attention-based multiscale cross-modal fusion network for pansharpening remote sensing images

Attribute-Prompting Multi-Modal Object Reasoning Transformer for Remote Sensing Visual Grounding