JOURNAL ARTICLE

Local masking meets progressive freezing: crafting efficient vision transformers for self-supervised learning

Abstract

This paper presents an innovative approach to self-supervised learning for Vision Transformers (ViTs), integrating local masked image modeling with progressive layer freezing. This method enhances the efficiency and speed of initial layer training in ViTs. By systematically freezing specific layers at strategic points during training, we reduce computational demands while maintaining learning capabilities. Our approach employs a novel multi-scale reconstruction process that fosters efficient learning in initial layers and enhances semantic comprehension across scales. The results demonstrate a substantial reduction in training time (12.5%) with a minimal impact on model accuracy (decrease in top-1 accuracy by 0.6%). Our method achieves top-1 and top-5 accuracies of 82.6% and 96.2%, respectively, underscoring its potential in scenarios where computational resources and time are critical. The implementation of our approach is available at our project's GitHub repository: https://github.com/utkutpcgl/ViTFreeze.

Keywords:
Computer science Masking (illustration) Transformer Artificial intelligence Computer vision Engineering Electrical engineering Visual arts Art

Metrics

1
Cited By
3.72
FWCI (Field Weighted Citation Impact)
22
Refs
0.79
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Advanced Sensor and Control Systems
Physical Sciences →  Engineering →  Control and Systems Engineering
Building Energy and Comfort Optimization
Physical Sciences →  Engineering →  Building and Construction
Infrared Target Detection Methodologies
Physical Sciences →  Engineering →  Aerospace Engineering
© 2026 ScienceGate Book Chapters — All rights reserved.