Rethinking Spatial Dimensions of Vision Transformers

Byeongho Heo; Sangdoo Yun; Dongyoon Han; Sanghyuk Chun; Junsuk Choe; Seong Joon Oh

doi:10.1109/iccv48922.2021.01172

ScienceGate Book Chapters

JOURNAL ARTICLE

Rethinking Spatial Dimensions of Vision Transformers

Byeongho Heo Sangdoo Yun Dongyoon Han Sanghyuk Chun Junsuk Choe Seong Joon Oh

Year: 2021 Journal: 2021 IEEE/CVF International Conference on Computer Vision (ICCV) Pages: 11916-11925

DOI: 10.1109/iccv48922.2021.01172

Get Full-Text PDF Get Analytical Report

Abstract

Vision Transformer (ViT) extends the application range of transformers from language processing to computer vision tasks as being an alternative architecture against the existing convolutional neural networks (CNN). Since the transformer-based architecture has been innovative for computer vision modeling, the design convention towards an effective architecture has been less studied yet. From the successful design principles of CNN, we investigate the role of spatial dimension conversion and its effectiveness on transformer-based architecture. We particularly attend to the dimension reduction principle of CNNs; as the depth increases, a conventional CNN increases channel dimension and decreases spatial dimensions. We empirically show that such a spatial dimension reduction is beneficial to a transformer architecture as well, and propose a novel Pooling-based Vision Transformer (PiT) upon the original ViT model. We show that PiT achieves the improved model capability and generalization performance against ViT. Throughout the extensive experiments, we further show PiT outperforms the baseline on several tasks such as image classification, object detection, and robustness evaluation. Source codes and ImageNet models are available at this https URL

Keywords:

Computer science Transformer Architecture Artificial intelligence Convolutional neural network Pooling Robustness (evolution) Computer engineering Computer vision Pattern recognition (psychology) Machine learning Engineering Voltage Electrical engineering

Metrics

Cited By

0.82

FWCI (Field Weighted Citation Impact)

Refs

0.84

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Advanced Neural Network Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Advanced Image and Video Retrieval Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Rethinking Spatial Dimensions of Vision Transformers

Abstract

Metrics

Citation History

Topics

Related Documents

Searching Intrinsic Dimensions of Vision Transformers

Rethinking the Self-Attention in Vision Transformers

Adaptive Tokenization: Rethinking Patch Embedding for Vision Transformers

Rethinking Vision Transformers for MobileNet Size and Speed

Adaptive Tokenization: Rethinking Patch Embedding for Vision Transformers