RSCaMa: Remote Sensing Image Change Captioning With State Space Model

Chenyang Liu; Keyan Chen; Bowen Chen; Haotian Zhang; Zhengxia Zou; Zhenwei Shi

doi:10.1109/lgrs.2024.3404604

ScienceGate Book Chapters

JOURNAL ARTICLE

RSCaMa: Remote Sensing Image Change Captioning With State Space Model

Chenyang Liu Keyan Chen Bowen Chen Haotian Zhang Zhengxia Zou Zhenwei Shi

Year: 2024 Journal: IEEE Geoscience and Remote Sensing Letters Vol: 21 Pages: 1-5 Publisher: Institute of Electrical and Electronics Engineers

DOI: 10.1109/lgrs.2024.3404604

Get Full-Text PDF Get Analytical Report

Abstract

Remote Sensing Image Change Captioning (RSICC) aims to describe surface changes between multi-temporal remote sensing images in language, including the changed object categories, locations, and dynamics of changing objects (e.g., added or disappeared). This poses challenges to spatial and temporal modeling of bi-temporal features. Despite previous methods progressing in the spatial change perception, there are still weaknesses in joint spatial-temporal modeling. To address this, in this paper, we propose a novel RSCaMa model, which achieves efficient joint spatial-temporal modeling through multiple CaMa layers, enabling iterative refinement of bi-temporal features. To achieve efficient spatial modeling, we introduce the recently popular Mamba (a state space model) with a global receptive field and linear complexity into the RSICC task and propose the Spatial Difference-aware SSM (SD-SSM), overcoming limitations of previous CNN- and Transformer-based methods in the receptive field and computational complexity. SD-SSM enhances the model's ability to capture spatial changes sharply. In terms of efficient temporal modeling, considering the potential correlation between the temporal scanning characteristics of Mamba and the temporality of the RSICC, we propose the Temporal-Traversing SSM (TT-SSM), which scans bi-temporal features in a temporal cross-wise manner, enhancing the model's temporal understanding and information interaction. Experiments validate the effectiveness of the efficient joint spatial-temporal modeling and demonstrate the outstanding performance of RSCaMa and the potential of the Mamba in the RSICC task. Additionally, we systematically compare three different language decoders, including Mamba, GPT-style decoder, and Transformer decoder, providing valuable insights for future RSICC research. The code will be available at https://github.com/Chen-Yang-Liu/RSCaMa.

Keywords:

Closed captioning Computer science Remote sensing Image (mathematics) Space (punctuation) State (computer science) Computer vision Artificial intelligence Geology Algorithm

Metrics

Cited By

37.64

FWCI (Field Weighted Citation Impact)

Refs

1.00

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Advanced Image and Video Retrieval Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Video Analysis and Summarization

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

RSCaMa: Remote Sensing Image Change Captioning With State Space Model

Abstract

Metrics

Citation History

Topics

Related Documents

RSIC-GMamba: A State-Space Model With Genetic Operations for Remote Sensing Image Captioning

Remote Sensing Image Change Captioning Using Multi-Attentive Network with Diffusion Model

Image Editing based on Diffusion Model for Remote Sensing Image Change Captioning

CD4C: Change Detection for Remote Sensing Image Change Captioning

Data Augmentation in Remote Sensing Image Change Captioning