Remote sensing image scene classification (RSSC), which assigns semantic labels to remote sensing images, is very important for remote sensing image interpretation. Thanks to the rapid development of deep learning, RSSC achieves significant breakthroughs by the use of convolutional neural network (CNN). However, CNN relies on local receptive fields and is difficult to capture long-range and global scene information. Moreover, the information of salient objects, which contributes to discriminate the category of scenes (e.g., airplanes indicate the airport scene), should be also exploited. To address this issue, a deep learning method, named multi-level representation learning (MLRL), is proposed to collaboratively extract pixel-level, patch-level, and object-level features, which respectively contain local, global, and object-oriented information. Specifically, pixel-level features are obtained by pixel-wise convolution operations within a CNN. Patch-level features are achieved by a patch-wise self-attention network. Object-level features are acquired by applying a CNN to a cropped sub-image, which conveys important information of salient objects. To this end, a three-branch network structure to respectively extract above features, is built. Finally, a decision fusion method is adopted to integrate multi-level features, and gives rise to refined classification results. Experiments conducted on widely-used datasets demonstrate the effectiveness of the proposed method.
Xiaoqiang LuXiangtao ZhengYuan Yuan
Ya ChenTaihao ZhengJingjing HanMingxin ZhengFengbin Zheng
Rajeshreddy DatlaNazil PerveenC. Krishna Mohan