Vision Foundation Model Guided Multimodal Fusion Network for Remote Sensing Semantic Segmentation

Pan Chen; Xijian Fan; Tardi Tjahjadi; Haiyan Guan; Liyong Fu; Qiaolin Ye; Ruili Wang

doi:10.1109/jstars.2025.3547880

ScienceGate Book Chapters

JOURNAL ARTICLE

Vision Foundation Model Guided Multimodal Fusion Network for Remote Sensing Semantic Segmentation

Pan Chen Xijian Fan Tardi Tjahjadi Haiyan Guan Liyong Fu Qiaolin Ye Ruili Wang

Year: 2025 Journal: IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing Vol: 18 Pages: 9409-9431 Publisher: Institute of Electrical and Electronics Engineers

DOI: 10.1109/jstars.2025.3547880

Get Full-Text PDF Get Analytical Report

Abstract

With the rapid development of Earth observation sensors, the fusion of remote sensing (RS) data in multimodal semantic segmentation has garnered significant research focus in recent years. The fusion of multimodal data presents challenges due to discrepancies in image acquisition mechanisms among different sensors, leading to misalignment issues. To mitigate this challenge, this article presents VSGNet, a novel multimodal fusion framework designed for RS semantic segmentation. The work aims to utilize vision structure guidance derived by vision foundation model for accurate segmentation without the need for auxiliary sensors. Specifically, the framework incorporates a cross-modal collaborative network for feature embedding that blends a convolutional neural network and vision transformer to simultaneously capture both local information and long-range dependencies from the input modalities. Subsequently, a multiscale cross-modal feature fusion comprising fusion enhancement and feature recalibration modules is proposed to emphasize the adaptive multiscale interaction of diverse complementary cues between each modality while suppressing the impact of noise and uncertainties present in RS data. Extensive experiments conducted on four diverse RS datasets, i.e., ISPRS Potsdam, ISPRS Vaihingen, LoveDA, and tree mapping, demonstrate VSGNet outperforms state-of-the-art RS semantic segmentation models.

Keywords:

Computer science Modal Segmentation Artificial intelligence Foundation (evidence) Computer vision Sensor fusion Remote sensing Image segmentation Fusion Natural language processing Geology

Metrics

Cited By

10.55

FWCI (Field Weighted Citation Impact)

Refs

0.94

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Remote-Sensing Image Classification

Physical Sciences → Engineering → Media Technology

Remote Sensing and Land Use

Physical Sciences → Earth and Planetary Sciences → Atmospheric Science

Advanced Image Fusion Techniques

Physical Sciences → Engineering → Media Technology

Vision Foundation Model Guided Multimodal Fusion Network for Remote Sensing Semantic Segmentation

Abstract

Metrics

Citation History

Topics

Related Documents

Vision Foundation Model-Driven Multiscale Expert Tuning for Multimodal Remote Sensing Semantic Segmentation

Multimodal Fusion Methods with Vision Transformers for Remote Sensing Semantic Segmentation

MDFNet: Multimodal Feature Decomposition and Fusion Network for Multimodal Remote Sensing Image Semantic Segmentation

CrossEarth: Geospatial Vision Foundation Model for Domain Generalizable Remote Sensing Semantic Segmentation

Multimodal Semantic Segmentation Model for Remote Sensing Image