Multi-Source Interactive Stair Attention for Remote Sensing Image Captioning

Xiangrong Zhang; Yunpeng Li; Xin Wang; Feixiang Liu; Zhaoji Wu; Xina Cheng; Licheng Jiao

doi:10.3390/rs15030579

ScienceGate Book Chapters

JOURNAL ARTICLE

Multi-Source Interactive Stair Attention for Remote Sensing Image Captioning

Xiangrong Zhang Yunpeng Li Xin Wang Feixiang Liu Zhaoji Wu Xina Cheng Licheng Jiao

Year: 2023 Journal: Remote Sensing Vol: 15 (3)Pages: 579-579 Publisher: Multidisciplinary Digital Publishing Institute

DOI: 10.3390/rs15030579

Get Full-Text PDF Get Analytical Report

Abstract

The aim of remote sensing image captioning (RSIC) is to describe a given remote sensing image (RSI) using coherent sentences. Most existing attention-based methods model the coherence through an LSTM-based decoder, which dynamically infers a word vector from preceding sentences. However, these methods are indirectly guided through the confusion of attentive regions, as (1) the weighted average in the attention mechanism distracts the word vector from capturing pertinent visual regions and (2) there are few constraints or rewards for learning long-range transitions. In this paper, we propose a multi-source interactive stair attention mechanism that separately models the semantics of preceding sentences and visual regions of interest. Specifically, the multi-source interaction takes previous semantic vectors as queries and applies an attention mechanism on regional features to acquire the next word vector, which reduces immediate hesitation by considering linguistics. The stair attention divides the attentive weights into three levels—that is, the core region, the surrounding region, and other regions—and all regions in the search scope are focused on differently. Then, a CIDEr-based reward reinforcement learning is devised, in order to enhance the quality of the generated sentences. Comprehensive experiments on widely used benchmarks (i.e., the Sydney-Captions, UCM-Captions, and RSICD data sets) demonstrate the superiority of the proposed model over state-of-the-art models, in terms of its coherence, while maintaining high accuracy.

Keywords:

Closed captioning Computer science Coherence (philosophical gambling strategy) Word (group theory) Artificial intelligence Semantics (computer science) Natural language processing Scope (computer science) Mechanism (biology) Image (mathematics) Linguistics

Metrics

Cited By

4.37

FWCI (Field Weighted Citation Impact)

Refs

0.93

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Advanced Image and Video Retrieval Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Domain Adaptation and Few-Shot Learning

Physical Sciences → Computer Science → Artificial Intelligence

Multi-Source Interactive Stair Attention for Remote Sensing Image Captioning

Abstract

Metrics

Citation History

Topics

Related Documents

Multi-View Attention Network for Remote Sensing Image Captioning

Remote Sensing Image Captioning With Multi-Scale Feature and Small Target Attention

Multi-Level Feature And Dual-Keys Attention For Remote Sensing Image Captioning

Bootstrapping Interactive Image–Text Alignment for Remote Sensing Image Captioning

Exploring Multi-Level Attention and Semantic Relationship for Remote Sensing Image Captioning