JOURNAL ARTICLE

A Self-Supervised Cross-Modal Remote Sensing Foundation Model with Multi-Domain Representation and Cross-Domain Fusion

Abstract

The construction of a basic model to extract generalized features from a large number of multimodal data is a new challenge in the field of remote sensing. Compared with natural scene images, When faced with a complex application scenario of remote sensing of multi-sensor acquisition, models that are suitable for a specific task are difficult to generalize to new scenarios. In this paper, we propose a model architecture based on the concepts of multi-domain representation and cross-domain fusion. By extracting strong generalization features from massive multi-modal data, a single foundation model can accomplish generalization interpretation for multiple downstream tasks. Experimental results show that the proposed model performs well on multiple downstream tasks, which validates the feasibility of the remote sensing cross-modal foundation model in the interpretation task.

Keywords:
Computer science Generalization Modal Domain (mathematical analysis) Representation (politics) Task (project management) Sensor fusion Artificial intelligence Field (mathematics) Machine learning Data mining Remote sensing Systems engineering Engineering

Metrics

8
Cited By
1.74
FWCI (Field Weighted Citation Impact)
18
Refs
0.84
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Remote-Sensing Image Classification
Physical Sciences →  Engineering →  Media Technology
Advanced Image and Video Retrieval Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Remote Sensing and Land Use
Physical Sciences →  Earth and Planetary Sciences →  Atmospheric Science
© 2026 ScienceGate Book Chapters — All rights reserved.