Visual Rotated Position Encoding Transformer for Remote Sensing Image Captioning

A.K. Liu; Lingwu Meng; Liang Xiao

doi:10.1109/jstars.2024.3487846

ScienceGate Book Chapters

JOURNAL ARTICLE

Visual Rotated Position Encoding Transformer for Remote Sensing Image Captioning

A.K. Liu Lingwu Meng Liang Xiao

Year: 2024 Journal: IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing Vol: 17 Pages: 20026-20040 Publisher: Institute of Electrical and Electronics Engineers

DOI: 10.1109/jstars.2024.3487846

Get Full-Text PDF Get Analytical Report

Abstract

Remote sensing image captioning (RSIC) is a crucial task in interpreting remote sensing images (RSIs), as it involves describing their content using clear and precise natural language. However, the RSIC encounters difficulties due to the intricate structure and distinctive features of the images, such as the issue of rotational ambiguity. The existence of visually alike objects or areas can result in misidentification. In addition, prioritizing groups of objects with strong relational ties during the captioning process poses a significant challenge. To address these challenges, we propose the visual rotated position encoding transformer for RSIC. First of all, rotation-invariant features and global features are extracted using a multilevel feature extraction (MFE) module. To focus on closely related rotated objects, we design a visual rotated position encoding module, which is incorporated into the transformer encoder to model directional relationships between objects. To distinguish similar features and guide caption generation, we propose a feature enhancement fusion module consisting of feature enhancement and feature fusion. The feature enhancement component adopts a self-attention mechanism to construct fully connected graphs for object features. The feature fusion component integrates global features and word vectors to guide the caption generation process. In addition, we construct an RSI rotated object detection dataset RSIC-ROD and pretrain a rotated object detector. The proposed method demonstrates significant performance improvements on four datasets, showcasing enhanced capabilities in preserving descriptive details, distinguishing similar objects, and accurately capturing object relationships.

Keywords:

Closed captioning Computer science Computer vision Encoding (memory) Artificial intelligence Transformer Position (finance) Hand position Remote sensing Image (mathematics) Engineering Geology Electrical engineering Voltage

Metrics

Cited By

2.12

FWCI (Field Weighted Citation Impact)

Refs

0.81

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Advanced Image and Video Retrieval Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Advanced Vision and Imaging

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Robotics and Sensor-Based Localization

Physical Sciences → Engineering → Aerospace Engineering

Visual Rotated Position Encoding Transformer for Remote Sensing Image Captioning

Abstract

Metrics

Citation History

Topics

Related Documents

Remote Sensing Image Captioning Using Transformer

Region-guided transformer for remote sensing image captioning

Cooperative Connection Transformer for Remote Sensing Image Captioning

Remote-Sensing Image Captioning Based on Multilayer Aggregated Transformer

Prior Knowledge-Guided Transformer for Remote Sensing Image Captioning