RMSF-ViT: Randomized Multi-scale Fusion Vision Transformer

Yujin Cho; A. R. Lee; Byung‐Gyu Kim; Jan Plato

doi:10.1007/978-981-95-3141-7_12

ScienceGate Book Chapters

BOOK-CHAPTER

RMSF-ViT: Randomized Multi-scale Fusion Vision Transformer

Yujin Cho A. R. Lee Byung‐Gyu Kim Jan Plato

Year: 2025 Communications in computer and information science Pages: 125-137 Publisher: Springer Science+Business Media

DOI: 10.1007/978-981-95-3141-7_12

Get Full-Text PDF Get Analytical Report

Abstract

Abstract The Vision Transformer (ViT) has demonstrated remarkable performance in a wide range of computer vision tasks, such as image classification, object detection, and image generation. Unlike convolutional neural networks (CNNs), ViT benefits from a global receptive field, which enables more effective modeling of relationships between image patches. However, the lack of inductive biases makes ViT models difficult to train stably, especially on limited datasets. Without access to large-scale pre-trained weights, performance often degrades significantly. To address this issue, we propose a novel architecture called RMSF-ViT. It employs a progressive fusion strategy that incorporates fine-grained patch information beyond the fixed single patch size used in conventional ViT architectures. In addition, RMSF-ViT reduces the number of attention heads by half compared to vanilla ViT models. This design improves both performance and computational efficiency, as demonstrated on the CIFAR-10, CIFAR-100, Flowers, and Pets datasets.

Keywords:

Metrics

Cited By

0.00

FWCI (Field Weighted Citation Impact)

Refs

0.55

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Topics

Infrared Target Detection Methodologies

Physical Sciences → Engineering → Aerospace Engineering

Advanced Neural Network Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

CCD and CMOS Imaging Sensors

Physical Sciences → Engineering → Electrical and Electronic Engineering

RMSF-ViT: Randomized Multi-scale Fusion Vision Transformer

Abstract

Metrics

Topics

Related Documents

Data-efficient multi-scale fusion vision transformer

ViT-MFDA: Vision Transformer With Multi-Scale Feature Enhancement and Dynamic Attention Fusion for Offshore Floating Aquaculture Segmentation

SQ-ViT: A Multi-Scale Vision Transformer With Quaternion for Endoscopic Images Classification

ViT-CoMer: Vision Transformer with Convolutional Multi-scale Feature Interaction for Dense Predictions

MF-VIT: Lithography Hotspot Detection Based on Multi-scale Feature and Vision Transformer