JOURNAL ARTICLE

Multi-Source Interactive Stair Attention for Remote Sensing Image Captioning

Xiangrong ZhangYunpeng LiXin WangFeixiang LiuZhaoji WuXina ChengLicheng Jiao

Year: 2023 Journal:   Remote Sensing Vol: 15 (3)Pages: 579-579   Publisher: Multidisciplinary Digital Publishing Institute

Abstract

The aim of remote sensing image captioning (RSIC) is to describe a given remote sensing image (RSI) using coherent sentences. Most existing attention-based methods model the coherence through an LSTM-based decoder, which dynamically infers a word vector from preceding sentences. However, these methods are indirectly guided through the confusion of attentive regions, as (1) the weighted average in the attention mechanism distracts the word vector from capturing pertinent visual regions and (2) there are few constraints or rewards for learning long-range transitions. In this paper, we propose a multi-source interactive stair attention mechanism that separately models the semantics of preceding sentences and visual regions of interest. Specifically, the multi-source interaction takes previous semantic vectors as queries and applies an attention mechanism on regional features to acquire the next word vector, which reduces immediate hesitation by considering linguistics. The stair attention divides the attentive weights into three levels—that is, the core region, the surrounding region, and other regions—and all regions in the search scope are focused on differently. Then, a CIDEr-based reward reinforcement learning is devised, in order to enhance the quality of the generated sentences. Comprehensive experiments on widely used benchmarks (i.e., the Sydney-Captions, UCM-Captions, and RSICD data sets) demonstrate the superiority of the proposed model over state-of-the-art models, in terms of its coherence, while maintaining high accuracy.

Keywords:
Closed captioning Computer science Coherence (philosophical gambling strategy) Word (group theory) Artificial intelligence Semantics (computer science) Natural language processing Scope (computer science) Mechanism (biology) Image (mathematics) Linguistics

Metrics

24
Cited By
4.37
FWCI (Field Weighted Citation Impact)
53
Refs
0.93
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Advanced Image and Video Retrieval Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Domain Adaptation and Few-Shot Learning
Physical Sciences →  Computer Science →  Artificial Intelligence
© 2026 ScienceGate Book Chapters — All rights reserved.