Researchers have investigated the potential of transformer-based models in remote sensing (RS) applications, such as scene categorization, after their recent success in natural language processing and computer vision tasks. In this review article, we provide an overview of the recent developments in vision transformer (ViT)-based models for remote sensing image scene classification (RSISC). We first introduce the basic architecture of transformer models and their extensions to computer vision tasks. Then, we summarize the current state-of-the-art ViT-based models for RSISC, including their architectures, training strategies, and performance evaluation. We also discuss the challenges and limitations of the existing ViT-based models. Finally, we outline some potential future directions for developing transformer-based models for RS applications. This review article intends to give a complete analysis of the current state-of-the-art and future research prospects for ViTs in RSISC, which can be used as a reference for researchers and practitioners in this field.
Pankaj Kumar GharaiMogalla Shashi
Hao YuanKun LiuJiechuan ShiCan WangWeiwei Wang
Laila BashmalYakoub BaziMohamad Mahmoud Al Rahhal
Yakoub BaziLaila BashmalMohamad Mahmoud Al RahhalReham Al-DayilNaif Al Ajlan
Senlin LiJinhong GuoJuan LiWenbin HeKun Zhao