Remote sensing images exhibit significant spatial geometric characteristics for ground objects such as buildings and roads, while targets within scenes show enormous scale variations, posing challenges to semantic segmentation algorithms' spatial structure modeling capabilities and cross-scale information processing abilities. Traditional methods lack specialized modeling mechanisms for spatial geometric features and suffer from information loss in multi-scale feature fusion. This paper proposes the SC-Net network, addressing these issues through three key technological innovations. First, we designed a feature attention layer where the spatial attention module captures spatial geometric patterns through directional feature decomposition, and the multi-scale attention module preserves feature information at different scales through adaptive pooling strategies. Second, we constructed a three-branch fusion transformer that employs cross-window attention and nine-group feature key-value pair interactions to achieve collaborative modeling of spatial, multi-scale, and global features. Finally, the multi-branch cascaded decoder enhances segmentation boundary accuracy through hierarchical feature fusion strategies. Comprehensive experiments on three standard remote sensing datasets validated the method's superiority. SC-Net achieved 63.04% mean intersection over union (MIOU) on Wuhan dense labeling dataset (WHDLD), 71.57% on Potsdam dataset, and 81.57% on Vaihingen dataset, outperforming state-of-the-art methods such as AerialFormer and SERNet by 0.67–2.12% MIOU. The method particularly demonstrated outstanding performance in scenarios with complex spatial structures and dense multi-scale targets, providing an effective solution for precise remote sensing image interpretation.
Fei WangYanhong YangHaozheng ZhangHongtao Wang
Yanhong YangFei WangHaozheng ZhangYushan XueGuodao ZhangShengyong Chen
Feiting WangYuan ZhangQiongqiong HuYu Zhu
Qiaolin ZengJingxiang ZhouXuerui Niu