Huming ZhuTianqi GaoZhixian LiZhipeng ChenQiuming LiKongmiao MiaoBiao HouLicheng Jiao
Visual grounding for remote sensing (RSVG) is the task of localizing the referred object in remote sensing (RS) images by parsing free-form language descriptions. However, RSVG faces the challenge of low detection accuracy due to unbalanced multi-scale grounding capabilities, where large objects have more prominent grounding accuracy than small objects. Based on Faster R-CNN, we propose Faster R-CNN in Visual Grounding for Remote Sensing (FR-RSVG), a two-stage method for grounding RS objects. Building on this foundation, to enhance the ability to ground multi-scale objects, we propose Faster R-CNN with Adaptive Vision-Language Fusion (FR-AVLF), which introduces a layered Adaptive Vision-Language Fusion (AVLF) module. Specifically, this method can adaptively fuse deep or shallow visual features according to the input text (e.g., location-related or object characteristic descriptions), thereby optimizing semantic feature representation and improving grounding accuracy for objects of different scales. Given that RSVG is essentially an expanded form of RS object detection, and considering the knowledge the model acquired in prior RS object detection tasks, we propose Faster R-CNN with Adaptive Vision-Language Fusion Pretrained (FR-AVLFPRE). To further enhance model performance, we propose Faster R-CNN with Cascaded Hierarchical Attention Grounding and Multi-Level Adaptive Vision-Language Fusion Pretrained (FR-CHAGAVLFPRE), which introduces a cascaded hierarchical attention grounding mechanism, employs a more advanced language encoder, and improves upon AVLF by proposing Multi-Level AVLF, significantly improving localization accuracy in complex scenarios. Extensive experiments on the DIOR-RSVG dataset demonstrate that our model surpasses most existing advanced models. To validate the generalization capability of our model, we conducted zero-shot inference experiments on shared categories between DIOR-RSVG and both Complex Description DIOR-RSVG (DIOR-RSVG-C) and OPT-RSVG datasets, achieving performance superior to most existing models.
Zhipeng ZhangYang ZouJi WangPeng Wang
Shabnam ChoudhuryPratham KurkureBiplab Banerjee
Ke LiDi WangHaojie XuHaodi ZhongCong Wang
Yuxi SunShanshan FengXutao LiYunming YeJian KangXu Huang
Donghang YuHaitao GuoQing XuJun LuChuan ZhaoYuzhun Lin