This paper presents a comparative analysis of transformerbased fusion methods applied to a novel multimodal dataset for remote sensing semantic segmentation. This investigation evaluates the impact of several fusion methods on the accuracy of the results. In particular, for early fusion, we investigate the Early Concatenation. For middle fusion, we investigate four methods, namely the Token Patch Embedding, Channel Patch Embedding, Token Fusion at Attention Level, and Cross-Attention. Finally, as a representative of late fusion, we investigate the use of Late Concatenation. The methods presented here are specifically designed to operate effectively with all modalities under investigation. Experiments conducted on the Ticino dataset show that Late Concatenation outperforms the best single modality RGB method of 4.04%, 2.24% and 3.47% respectively on accuracy, precision and mIoU. This study provides an opportunity to further explore fusion methods utilizing transformers, thereby enhancing our understanding of the potential of data fusion.
Pan ChenXijian FanTardi TjahjadiHaiyan GuanLiyong FuQiaolin YeRuili Wang
Guangsheng ChenFangyu SunWeipeng JingWeitao ZouDonglin DiYang SongLei Fan
Xianping MaXiaokang ZhangMan-On PunMing Liu
Weimin QiH. T. ChenZhiming WangMeng Wang
Haixia FengQingwu HuPengcheng ZhaoShunli WangMingyao AiDaoyuan ZhengTiancheng Liu