JOURNAL ARTICLE

Multimodal Semantic Segmentation Based On Improved Vision Transformers

Abstract

Although semantic segmentation networks based on CNN or RNN can already perform the semantic segmentation task better, the introduction of multimodal input and Transformer can make the performance of semantic segmentation networks have further room for improvement. In this paper, we try to apply Transformer to the multimodal input scenario, but the ability of Transformer to handle multimodal inputs is not ideal, and how and where features from different modalities should interact with each other poses a great challenge to the design of the fusion scheme of the model architecture. In this regard, this paper improves Vision Transformer by using Token Fusion's model, and finally completes the image semantic segmentation task for RGB-Depth multimodal input efficiently.

Keywords:
Computer science Segmentation Transformer Artificial intelligence Computer vision Image segmentation Engineering

Metrics

0
Cited By
0.00
FWCI (Field Weighted Citation Impact)
5
Refs
0.23
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Topics

Advanced Neural Network Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Advanced Image and Video Retrieval Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Infrared Target Detection Methodologies
Physical Sciences →  Engineering →  Aerospace Engineering
© 2026 ScienceGate Book Chapters — All rights reserved.