JOURNAL ARTICLE

A Multilevel Multimodal Fusion Transformer for Remote Sensing Semantic Segmentation

Xianping MaXiaokang ZhangMan-On PunMing Liu

Year: 2024 Journal:   IEEE Transactions on Geoscience and Remote Sensing Vol: 62 Pages: 1-15   Publisher: Institute of Electrical and Electronics Engineers

Abstract

Accurate semantic segmentation of remote sensing data plays a crucial role in the success of geoscience research and applications. Recently, multimodal fusion-based segmentation models have attracted much attention due to their outstanding performance as compared to conventional single-modal techniques. However, most of these models perform their fusion operation using convolutional neural networks (CNN) or the vision transformer (Vit), resulting in insufficient local-global contextual modeling and representative capabilities. In this work, a multilevel multimodal fusion scheme called FTransUNet is proposed to provide a robust and effective multimodal fusion backbone for semantic segmentation by integrating both CNN and Vit into one unified fusion framework. Firstly, the shallow-level features are first extracted and fused through convolutional layers and shallow-level feature fusion (SFF) modules. After that, deep-level features characterizing semantic information and spatial relationships are extracted and fused by a well-designed Fusion Vit (FVit). It applies Adaptively Mutually Boosted Attention (Ada-MBA) layers and Self-Attention (SA) layers alternately in a three-stage scheme to learn cross-modality representations of high inter-class separability and low intra-class variations. Specifically, the proposed Ada-MBA computes SA and Cross-Attention (CA) in parallel to enhance intra- and cross-modality contextual information simultaneously while steering attention distribution towards semantic-aware regions. As a result, FTransUNet can fuse shallow-level and deep-level features in a multilevel manner, taking full advantage of CNN and transformer to accurately characterize local details and global semantics, respectively. Extensive experiments confirm the superior performance of the proposed FTransUNet compared with other multimodal fusion approaches on two fine-resolution remote sensing datasets, namely ISPRS Vaihingen and Potsdam. The source code in this work is available at https://github.com/sstary/SSRS.

Keywords:
Computer science Segmentation Fusion Artificial intelligence Transformer Computer vision Remote sensing Image segmentation Pattern recognition (psychology) Geology Engineering

Metrics

177
Cited By
108.84
FWCI (Field Weighted Citation Impact)
65
Refs
1.00
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Remote-Sensing Image Classification
Physical Sciences →  Engineering →  Media Technology
Advanced Image Fusion Techniques
Physical Sciences →  Engineering →  Media Technology
Remote Sensing and Land Use
Physical Sciences →  Earth and Planetary Sciences →  Atmospheric Science

Related Documents

JOURNAL ARTICLE

CNN and Transformer Fusion for Remote Sensing Image Semantic Segmentation

Xin ChenDongfen LiMingzhe LiuJiaru Jia

Journal:   Remote Sensing Year: 2023 Vol: 15 (18)Pages: 4455-4455
JOURNAL ARTICLE

Learning Frequency-Domain Fusion for Multimodal Remote Sensing Semantic Segmentation

Guangsheng ChenFangyu SunWeipeng JingWeitao ZouDonglin DiYang SongLei Fan

Journal:   IEEE Transactions on Geoscience and Remote Sensing Year: 2025 Vol: 63 Pages: 1-16
JOURNAL ARTICLE

CTFNet: CNN-Transformer Fusion Network for Remote-Sensing Image Semantic Segmentation

Honglin WuPeng HuangMin ZhangWenlong Tang

Journal:   IEEE Geoscience and Remote Sensing Letters Year: 2023 Vol: 21 Pages: 1-5
© 2026 ScienceGate Book Chapters — All rights reserved.