Shuting LiuTeng XuWenbo ZhangXiaomin Wang
Abstract Remote sensing image scene classification (RSISC) is a key focus in the field of remote sensing image interpretation. Both CNN-Based and ViT-Based semantic feature extraction methods have been applied to RSISC. However, the challenges posed by complex scene images, such as high intraclass diversity and interclass similarity, remain significant obstacles to the feature extraction capabilities of network models. To address the aforementioned challenges, this study introduces the SGMSNet model. The first branch optimizes the key-value information in the self-attention mechanism, enabling ViT to effectively extract the global structural features of the target scene image while maintaining a lower network parameter cost. The second branch extracts multiple irregular local key features of the target scene image by constructing a lightweight pyramid network to supplement the feature loss of the first branch. Subsequently, a designed feature fusion module is employed to automatically adjust and merge the weights of global and local features for each scene image. The overall accuracy results on the UC Merced Land Use Dataset (UCM), the Aerial Image Dataset (AID), and the Northwestern Polytechnical University (NWPU)-RESISC45 Dataset were 99.17%, 97.43% and 94.87%, respectively. These results show that SGMSNet is suitable for the applications of with low network complexity requirements.
Zhou YangXiaodong MuFeng’an Zhao
Yingjie JinXiaoyang LiJianjun Liu
Ruiyao LiuXiaoyong BianYuxia Sheng
Jiadong LinLingling LiHuaji ZhouLicheng JiaoFang LiuXu Liu