Abstract

In traditional remote sensing image captioning models, the attention mechanism plays a dominant role and has been used to integrate image features to infer the latent visual-semantic alignment. However, the scenes of remote sensing image are complex and diverse, using only one attention module to capture features often leads to insufficient semantic representation. In our work, we present a novel Multi-view Attention Network (MAN) model to realize feature integration from different views. With MAN, more semantically rich ensemble attended features can be obtained by different attention modules. Specifically, we enforce the weights of attention modules to be diverse through a cosine distance loss. This will provide the model with distinct views to make semantic predictions for each feature. Extensive experiments on benchmark datasets demonstrate the effectiveness of the proposed model for the task of remote sensing image captioning.

Keywords:
Closed captioning Computer science Feature (linguistics) Benchmark (surveying) Image (mathematics) Artificial intelligence Task (project management) Representation (politics) Semantics (computer science) Semantic feature Feature extraction Attention network Pattern recognition (psychology)

Metrics

8
Cited By
0.51
FWCI (Field Weighted Citation Impact)
13
Refs
0.66
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Advanced Image and Video Retrieval Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Domain Adaptation and Few-Shot Learning
Physical Sciences →  Computer Science →  Artificial Intelligence
© 2026 ScienceGate Book Chapters — All rights reserved.