Huiqing WangZhongyu LiLinfeng Wu
ABSTRACT In the earth observation mission, multimodal remote sensing (RS) image fusion technology has attracted great interest of many researchers. Although deep learning networks have made great progress in the field of multimodal RS image classification, there are still challenges in multimodal feature fusion strategies, the sequence of spectral features, and the location of spatial features. Therefore, this paper presents a novel approach for classifying multimodal RS data based on cross‐transformer fusion (CTF). Firstly, Independent Component Analysis (ICA) was used to reduce the dimension of spectral features, and dual‐branch 3D and 2D convolutional neural networks (CNNs) were used for multimodal feature extraction to significantly extract and acquire the spectral‐spatial characteristics and height‐related features across multiple modalities. Then, in order to fuse the feature information extracted from the two modalities, a cross‐transformer feature fusion strategy was designed, which used the powerful long‐distance dependence ability of transformer and the advantages of processing spectral feature sequences to effectively fuse multimodal features. By fully utilizing the strong capability of CNNs in extracting spatial context information and the transformer network architecture based on CTF fusion, the ability of recognition, extraction, and fusion of multimodal feature information can be effectively improved. To validate the efficacy of the proposed approach, three benchmark multimodal RS datasets were selected for evaluation. The experimental results demonstrate that this method outperforms existing state‐of‐the‐art techniques in terms of classification accuracy.
Swalpa Kumar RoyAnkur DeriaDanfeng HongBehnood RastiAntonio PlazaJocelyn Chanussot
Mengru MaWenping MaLicheng JiaoXu LiuLingling LiZhixi FengFang LiuShuyuan Yang
Xiaoli GaoMing ZhangDahua YuJianjun LiKehong LiuGuoqing Li