CADFormer: Fine-Grained Cross-Modal Alignment and Decoding Transformer for Referring Remote Sensing Image Segmentation

Maofu Liu; Xin Jiang; Xiaokang Zhang

doi:10.1109/jstars.2025.3576595

ScienceGate Book Chapters

JOURNAL ARTICLE

CADFormer: Fine-Grained Cross-Modal Alignment and Decoding Transformer for Referring Remote Sensing Image Segmentation

Maofu Liu Xin Jiang Xiaokang Zhang

Year: 2025 Journal: IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing Vol: 18 Pages: 14557-14569 Publisher: Institute of Electrical and Electronics Engineers

DOI: 10.1109/jstars.2025.3576595

Get Full-Text PDF Get Analytical Report

Abstract

Referring remote sensing image segmentation (RRSIS) is a challenging task, aiming to segment specific target objects in remote sensing images based on a given language expression. Existing RRSIS methods typically employ coarse-grained unidirectional alignment approaches to obtain multimodal features, and they often overlook the critical role of language features as contextual information during the decoding process. Consequently, these methods exhibit weak object-level correspondence between visual and language features, leading to incomplete or erroneous predicted masks, especially when handling complex expressions and intricate remote sensing image scenes. To address these challenges, we propose a fine-grained cross-modal alignment and decoding Transformer, CADFormer, for RRSIS. Specifically, we design a semantic mutual guidance alignment module (SMGAM) to achieve both vision-to-language and language-to-vision alignment, enabling comprehensive integration of visual and textual features for fine-grained cross-modal alignment. Furthermore, a textual-enhanced cross-modal decoder (TCMD) is introduced to incorporate language features during decoding, using refined textual information as context to enhance the relationship between cross-modal features. To thoroughly evaluate the performance of CADFormer, especially for inconspicuous targets in complex scenes, we constructed a new RRSIS dataset, called RRSIS-HR, which includes larger high-resolution remote sensing image patches and semantically richer language expressions. Extensive experiments on the RRSIS-HR dataset and the popular RRSIS-D dataset demonstrate the effectiveness and superiority of CADFormer.

Keywords:

Decoding methods Computer science Image segmentation Transformer Computer vision Modal Segmentation Artificial intelligence Voltage Materials science Algorithm Engineering Electrical engineering

Metrics

Cited By

14.32

FWCI (Field Weighted Citation Impact)

Refs

0.95

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Advanced Image and Video Retrieval Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Satellite Image Processing and Photogrammetry

Physical Sciences → Engineering → Ocean Engineering

Medical Image Segmentation Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

CADFormer: Fine-Grained Cross-Modal Alignment and Decoding Transformer for Referring Remote Sensing Image Segmentation

Abstract

Metrics

Citation History

Topics

Related Documents

Exploring Fine-Grained Image-Text Alignment for Referring Remote Sensing Image Segmentation

Area-keywords cross-modal alignment for referring image segmentation

Cross-modal transformer with language query for referring image segmentation

RRSIS: Referring Remote Sensing Image Segmentation

Exploring a Fine-Grained Multiscale Method for Cross-Modal Remote Sensing Image Retrieval