In this paper, we present a scene classification method based on vision transformers. These types of networks, which are now the standard models in natural language processing (NLP) do not rely on convolution block as in convolutional neural networks (CNNs). Alternatively, they are based on a mechanism known as multi-head self-attention (MSA), which captures the contextual relations between image pixels regardless of their spatial distance. At the first step, the images under analysis are split into patches, then converted to sequence by flattening and embedding. The embedding position is encoded and added to the sequence to preserve the order of the patches. Then, the resulting sequence is fed to several MSA layers for generating the final representation. To increase the classification performance, we employed several data augmentation strategies to expand the size and the diversity of the training data. Additionally, we show experimentally that we can compress the network by pruning half of its layers while keeping the competing performance. We further investigate the performance of the data-efficient image transformers (DeiT), a version of the model that is trained by knowledge distillation with less amount of data. Experimental results on two remote sensing datasets show that vision transformers can outperform state-of-the-art methods based on CNNs.
Pankaj Kumar GharaiMogalla Shashi
Hao YuanKun LiuJiechuan ShiCan WangWeiwei Wang
Jianrong ZhangHongwei ZhaoJiao Li
Yakoub BaziLaila BashmalMohamad Mahmoud Al RahhalReham Al-DayilNaif Al Ajlan