Wen XieYuzhuo ZhangHongyue SunQinzhe Wu
Although Convolutional Neural Network (CNN) and Vision Transformer (ViT) have excellent performance in hyperspectral image (HSI) classification. However, due to the inherent network limitations, CNN cannot fully mine the spectral feature information well in HSI, and ViT could not effective extract the local spatial feature of HSI. In order to solve the above problems, we propose a new network which is 3D-CNN Multi-Head Self-Attention Fusion Transformer (3DMFT), which is combined with Transformer and CNN. 3DMFT fuses 3D-CNN and QKV to learn the deep and shallow features in HSI. Moreover, the local spatial-spectral position encoding obtains spatial-spectral position information between elements , and then induct pyramid model to transfer image feature from shallow layer to deep layer. Experimental results show that 3DMFT can obtain global context dependencies and local delicate feature well. Compared with some state-of-the-art methods, the proposed 3DMFT network is more efficient.
Sunita AryaShiv Ram DubeyS. Manthira MoorthiDebajyoti DharSatish Kumar Singh
Weijia ZengWei LiMengmeng ZhangHao WangMeng LvYue YangRan Tao
Junbo ZhouShan ZengZuyin XiaoJinbo ZhouHao LiZhen Kang
Alou DiakiteJiangsheng GuiXiaping Fu
Peng ChenWenxuan HeQian FengGuangyao ShiJingwen Yan