J. F. QiuWei ChangWei RenShanshan HouRonghao Yang
Accurate semantic segmentation of high-resolution remote sensing imagery is challenged by substantial intra-class variability, inter-class similarity, and the limitations of single-modality data. This paper proposes MMFNet, a novel multimodal fusion network that leverages the Mamba architecture to efficiently capture long-range dependencies for semantic segmentation tasks. MMFNet adopts a dual-encoder design, combining ResNet-18 for local detail extraction and VMamba for global contextual modelling, striking a balance between segmentation accuracy and computational efficiency. A Multimodal Feature Fusion Block (MFFB) is introduced to effectively integrate complementary information from optical imagery and digital surface models (DSMs), thereby enhancing multimodal feature interaction and improving segmentation accuracy. Furthermore, a frequency-aware upsampling module (FreqFusion) is incorporated in the decoder to enhance boundary delineation and recover fine spatial details. Extensive experiments on the ISPRS Vaihingen and Potsdam benchmarks demonstrate that MMFNet achieves mean IoU scores of 83.50% and 86.06%, outperforming eight state-of-the-art methods while maintaining relatively low computational complexity. These results highlight MMFNet’s potential for efficient and accurate multimodal semantic segmentation in remote sensing applications.
Shu TianMinglei LiLin CaoLihong KangJing TianXiangwei XingBo ShenKangning DuChong FuYe Zhang
Haoyue SunJianjun LiuJinlong YangZebin Wu
Xiao LiuTao WangFei JinJie RuiShuxiang WangZiheng HuangYujie ZouXiaowei Yu
Famao YeSze An Peter TanWenye HuangXiaohua XuShunliang Jiang
Wenliang DuYang GuJiaqi ZhaoHancheng ZhuRui YaoYong Zhou