JOURNAL ARTICLE

CADFormer: Fine-Grained Cross-Modal Alignment and Decoding Transformer for Referring Remote Sensing Image Segmentation

Maofu LiuXin JiangXiaokang Zhang

Year: 2025 Journal:   IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing Vol: 18 Pages: 14557-14569   Publisher: Institute of Electrical and Electronics Engineers

Abstract

Referring remote sensing image segmentation (RRSIS) is a challenging task, aiming to segment specific target objects in remote sensing images based on a given language expression. Existing RRSIS methods typically employ coarse-grained unidirectional alignment approaches to obtain multimodal features, and they often overlook the critical role of language features as contextual information during the decoding process. Consequently, these methods exhibit weak object-level correspondence between visual and language features, leading to incomplete or erroneous predicted masks, especially when handling complex expressions and intricate remote sensing image scenes. To address these challenges, we propose a fine-grained cross-modal alignment and decoding Transformer, CADFormer, for RRSIS. Specifically, we design a semantic mutual guidance alignment module (SMGAM) to achieve both vision-to-language and language-to-vision alignment, enabling comprehensive integration of visual and textual features for fine-grained cross-modal alignment. Furthermore, a textual-enhanced cross-modal decoder (TCMD) is introduced to incorporate language features during decoding, using refined textual information as context to enhance the relationship between cross-modal features. To thoroughly evaluate the performance of CADFormer, especially for inconspicuous targets in complex scenes, we constructed a new RRSIS dataset, called RRSIS-HR, which includes larger high-resolution remote sensing image patches and semantically richer language expressions. Extensive experiments on the RRSIS-HR dataset and the popular RRSIS-D dataset demonstrate the effectiveness and superiority of CADFormer.

Keywords:
Decoding methods Computer science Image segmentation Transformer Computer vision Modal Segmentation Artificial intelligence Voltage Materials science Algorithm Engineering Electrical engineering

Metrics

3
Cited By
14.32
FWCI (Field Weighted Citation Impact)
48
Refs
0.95
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Advanced Image and Video Retrieval Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Satellite Image Processing and Photogrammetry
Physical Sciences →  Engineering →  Ocean Engineering
Medical Image Segmentation Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition

Related Documents

JOURNAL ARTICLE

Exploring Fine-Grained Image-Text Alignment for Referring Remote Sensing Image Segmentation

Sen LeiXinyu XiaoTianlin ZhangHeng-Chao LiZhenwei ShiQing Zhu

Journal:   IEEE Transactions on Geoscience and Remote Sensing Year: 2024 Vol: 63 Pages: 1-11
JOURNAL ARTICLE

Area-keywords cross-modal alignment for referring image segmentation

Huiyong ZhangLichun WangShuang LiKai XuBaocai Yin

Journal:   Neurocomputing Year: 2024 Vol: 581 Pages: 127475-127475
JOURNAL ARTICLE

RRSIS: Referring Remote Sensing Image Segmentation

Zhenghang YuanLichao MouYuansheng HuaXiao Xiang Zhu

Journal:   IEEE Transactions on Geoscience and Remote Sensing Year: 2024 Vol: 62 Pages: 1-12
JOURNAL ARTICLE

Exploring a Fine-Grained Multiscale Method for Cross-Modal Remote Sensing Image Retrieval

Zhiqiang YuanWenkai ZhangKun FuXuan LiChubo DengHongqi WangXian Sun

Journal:   IEEE Transactions on Geoscience and Remote Sensing Year: 2021 Vol: 60 Pages: 1-19
© 2026 ScienceGate Book Chapters — All rights reserved.