JOURNAL ARTICLE

Visual Rotated Position Encoding Transformer for Remote Sensing Image Captioning

A.K. LiuLingwu MengLiang Xiao

Year: 2024 Journal:   IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing Vol: 17 Pages: 20026-20040   Publisher: Institute of Electrical and Electronics Engineers

Abstract

Remote sensing image captioning (RSIC) is a crucial task in interpreting remote sensing images (RSIs), as it involves describing their content using clear and precise natural language. However, the RSIC encounters difficulties due to the intricate structure and distinctive features of the images, such as the issue of rotational ambiguity. The existence of visually alike objects or areas can result in misidentification. In addition, prioritizing groups of objects with strong relational ties during the captioning process poses a significant challenge. To address these challenges, we propose the visual rotated position encoding transformer for RSIC. First of all, rotation-invariant features and global features are extracted using a multilevel feature extraction (MFE) module. To focus on closely related rotated objects, we design a visual rotated position encoding module, which is incorporated into the transformer encoder to model directional relationships between objects. To distinguish similar features and guide caption generation, we propose a feature enhancement fusion module consisting of feature enhancement and feature fusion. The feature enhancement component adopts a self-attention mechanism to construct fully connected graphs for object features. The feature fusion component integrates global features and word vectors to guide the caption generation process. In addition, we construct an RSI rotated object detection dataset RSIC-ROD and pretrain a rotated object detector. The proposed method demonstrates significant performance improvements on four datasets, showcasing enhanced capabilities in preserving descriptive details, distinguishing similar objects, and accurately capturing object relationships.

Keywords:
Closed captioning Computer science Computer vision Encoding (memory) Artificial intelligence Transformer Position (finance) Hand position Remote sensing Image (mathematics) Engineering Geology Electrical engineering Voltage

Metrics

4
Cited By
2.12
FWCI (Field Weighted Citation Impact)
63
Refs
0.81
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Advanced Image and Video Retrieval Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Advanced Vision and Imaging
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Robotics and Sensor-Based Localization
Physical Sciences →  Engineering →  Aerospace Engineering

Related Documents

JOURNAL ARTICLE

Region-guided transformer for remote sensing image captioning

Kai ZhaoWei Xiong

Journal:   International Journal of Digital Earth Year: 2024 Vol: 17 (1)
JOURNAL ARTICLE

Cooperative Connection Transformer for Remote Sensing Image Captioning

Kai ZhaoWei Xiong

Journal:   IEEE Transactions on Geoscience and Remote Sensing Year: 2024 Vol: 62 Pages: 1-14
JOURNAL ARTICLE

Remote-Sensing Image Captioning Based on Multilayer Aggregated Transformer

Chenyang LiuRui ZhaoZhenwei Shi

Journal:   IEEE Geoscience and Remote Sensing Letters Year: 2022 Vol: 19 Pages: 1-5
JOURNAL ARTICLE

Prior Knowledge-Guided Transformer for Remote Sensing Image Captioning

Lingwu MengJing WangYang YangLiang Xiao

Journal:   IEEE Transactions on Geoscience and Remote Sensing Year: 2023 Vol: 61 Pages: 1-13
© 2026 ScienceGate Book Chapters — All rights reserved.