Cascaded Hierarchical Attention with Adaptive Fusion for Visual Grounding in Remote Sensing

Huming Zhu; Tianqi Gao; Zhixian Li; Zhipeng Chen; Qiuming Li; Kongmiao Miao; Biao Hou; Licheng Jiao

doi:10.3390/rs17172930

ScienceGate Book Chapters

JOURNAL ARTICLE

Cascaded Hierarchical Attention with Adaptive Fusion for Visual Grounding in Remote Sensing

Huming Zhu Tianqi Gao Zhixian Li Zhipeng Chen Qiuming Li Kongmiao Miao Biao Hou Licheng Jiao

Year: 2025 Journal: Remote Sensing Vol: 17 (17)Pages: 2930-2930 Publisher: Multidisciplinary Digital Publishing Institute

DOI: 10.3390/rs17172930

Get Full-Text PDF Get Analytical Report

Abstract

Visual grounding for remote sensing (RSVG) is the task of localizing the referred object in remote sensing (RS) images by parsing free-form language descriptions. However, RSVG faces the challenge of low detection accuracy due to unbalanced multi-scale grounding capabilities, where large objects have more prominent grounding accuracy than small objects. Based on Faster R-CNN, we propose Faster R-CNN in Visual Grounding for Remote Sensing (FR-RSVG), a two-stage method for grounding RS objects. Building on this foundation, to enhance the ability to ground multi-scale objects, we propose Faster R-CNN with Adaptive Vision-Language Fusion (FR-AVLF), which introduces a layered Adaptive Vision-Language Fusion (AVLF) module. Specifically, this method can adaptively fuse deep or shallow visual features according to the input text (e.g., location-related or object characteristic descriptions), thereby optimizing semantic feature representation and improving grounding accuracy for objects of different scales. Given that RSVG is essentially an expanded form of RS object detection, and considering the knowledge the model acquired in prior RS object detection tasks, we propose Faster R-CNN with Adaptive Vision-Language Fusion Pretrained (FR-AVLFPRE). To further enhance model performance, we propose Faster R-CNN with Cascaded Hierarchical Attention Grounding and Multi-Level Adaptive Vision-Language Fusion Pretrained (FR-CHAGAVLFPRE), which introduces a cascaded hierarchical attention grounding mechanism, employs a more advanced language encoder, and improves upon AVLF by proposing Multi-Level AVLF, significantly improving localization accuracy in complex scenarios. Extensive experiments on the DIOR-RSVG dataset demonstrate that our model surpasses most existing advanced models. To validate the generalization capability of our model, we conducted zero-shot inference experiments on shared categories between DIOR-RSVG and both Complex Description DIOR-RSVG (DIOR-RSVG-C) and OPT-RSVG datasets, achieving performance superior to most existing models.

Keywords:

Remote sensing Computer science Fusion Environmental science Geology

Metrics

Cited By

0.00

FWCI (Field Weighted Citation Impact)

Refs

0.43

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Topics

Remote-Sensing Image Classification

Physical Sciences → Engineering → Media Technology

Advanced Image Fusion Techniques

Physical Sciences → Engineering → Media Technology

Remote Sensing and Land Use

Physical Sciences → Earth and Planetary Sciences → Atmospheric Science

Cascaded Hierarchical Attention with Adaptive Fusion for Visual Grounding in Remote Sensing

Abstract

Metrics

Topics

Related Documents

Adaptive Scale Fusion via Uncertainty Estimation for Visual Grounding in Remote Sensing Images

Improving visual grounding in remote sensing images with adaptive modality guidance

Language-Guided Progressive Attention for Visual Grounding in Remote Sensing Images

Visual Grounding in Remote Sensing Images

Hierarchical Attention and Bilinear Fusion for Remote Sensing Image Scene Classification