Although semantic segmentation networks based on CNN or RNN can already perform the semantic segmentation task better, the introduction of multimodal input and Transformer can make the performance of semantic segmentation networks have further room for improvement. In this paper, we try to apply Transformer to the multimodal input scenario, but the ability of Transformer to handle multimodal inputs is not ideal, and how and where features from different modalities should interact with each other poses a great challenge to the design of the fusion scheme of the model architecture. In this regard, this paper improves Vision Transformer by using Token Fusion's model, and finally completes the image semantic segmentation task for RGB-Depth multimodal input efficiently.
Xianfan GuYingdong HuChuan WenYang Gao
Hans ThisankeChamli DeshanKavindu ChamithSachith SeneviratneRajith VidanaarachchiDamayanthi Herath
Xinting HuLi JiangBernt Schiele
Li ZhangJiachen LuSixiao ZhengXinxuan ZhaoXiatian ZhuYanwei FuTao XiangJianfeng FengPhilip H. S. Torr