A critical requirement for the success of supervised deep learning lies in having numerous annotated images, which is often challenging to fulfill in remote sensing semantic segmentation tasks. Self-supervised contrastive learning (CL) offers a strategy for learning general feature representations by pre-training neural networks on vast amounts of unlabeled data and subsequently fine-tuning them on downstream tasks with limited annotations. However, the vast majority of CL methods are designed based on instance discriminative pretext tasks, focusing solely on learning the global representation of the entire image while disregarding the essential spatial and semantic correlations crucial for semantic segmentation tasks. To address the above issues, in this paper, we propose a spatial and semantic consistency contrastive learning (SSCCL) framework for the semantic segmentation task of remote sensing images. Specifically, a consistency branch in SSCCL is designed to learn feature representations with spatial and semantic consistency by maximizing the similarity of the overlapping regions of the two augmented views. Additionally, an instance branch is introduced to learn global representations by enforcing the similarity of two augmented views from one image. Through the integration of the consistency branch and instance branch, the proposed SSCCL framework can learn robust and informative feature representations for semantic segmentation in remote sensing scenarios. The proposed method was evaluated on three publicly available remote sensing semantic segmentation datasets, and the experimental results show that our method achieves superior segmentation performance with limited annotations compared to state-of-the-art CL methods as well as ImageNet pre-training method.
Haifeng LiYi LiGuo ZhangRuoyun LiuHaozhe HuangQing ZhuChao Tao
Shuyi HeQingyong LiYang LiuWen Wang
Jia-Xin WangSi-Bao ChenChris DingJin TangBin Luo