Byeongho HeoSangdoo YunDongyoon HanSanghyuk ChunJunsuk ChoeSeong Joon Oh
Vision Transformer (ViT) extends the application range of transformers from language processing to computer vision tasks as being an alternative architecture against the existing convolutional neural networks (CNN). Since the transformer-based architecture has been innovative for computer vision modeling, the design convention towards an effective architecture has been less studied yet. From the successful design principles of CNN, we investigate the role of spatial dimension conversion and its effectiveness on transformer-based architecture. We particularly attend to the dimension reduction principle of CNNs; as the depth increases, a conventional CNN increases channel dimension and decreases spatial dimensions. We empirically show that such a spatial dimension reduction is beneficial to a transformer architecture as well, and propose a novel Pooling-based Vision Transformer (PiT) upon the original ViT model. We show that PiT achieves the improved model capability and generalization performance against ViT. Throughout the extensive experiments, we further show PiT outperforms the baseline on several tasks such as image classification, object detection, and robustness evaluation. Source codes and ImageNet models are available at this https URL
Fanghui XueBiao YangYingyong QiJack XinZ CaiN VasconcelosA DosovitskiyL BeyerA KolesnikovD WeissenbornX ZhaiT UnterthinerM DehghaniM MindererG HeigoldS GellyK HeX ZhangS RenJ SunT.-Y LinM MaireS BelongieJ HaysP PeronaD RamananP DollarC ZitnickH LiuK SimonyanY YangZ LiuJ LiZ ShenG HuangS YanC ZhangZ LiuY LinY CaoH HuY WeiZ ZhangS LinB GuoH TouvronM CordM DouzeF MassaA SablayrollesH JegouA VaswaniN ShazeerN ParmarJ UszkoreitL JonesA GomezL KaiserI PolosukhinW WenC WuY WangY ChenH LiF YuK HuangM WangY ChengW ChuL CuiM YuanY LinZ ZhangH ZhangL ZhaoT ChenS ArikT PfisterB ZophV VasudevanJ ShlensQ Le
Yanyu LiHu JuYang WenGeorgios EvangelidisKamyar SalahiYanzhi WangSergey TulyakovJian Ren