Shibao LiYixuan LiuZhaoyu WangXuerong CuiYunwu ZhangZ. JiaoJinze Zhu
The Transformer is one of the mainstream methods in computer vision. Most Transformer based architectures focus on the design of spatial attention and optimizing the computational complexity of high resolution of pixels in image but pay little attention to modeling channel dependencies and optimizing the computational complexity associated with a large number of channels. In this paper, we propose a new channel window based self-attention mechanism and apply two consecutive transformer layers to capture global channel dependencies through permuting channel layer, which can greatly reduce the computational complexity caused by a large number of channels. Meanwhile, a new linear layer for channel attention is proposed, which eliminates the need for position bias in Transformer. The proposed method can be conveniently appended to the existing image classification architectures in parallel with minimal modification. We demonstrate the feasibility of the proposed method on the state-of-the-art transformer-based architecture for image classification and improve the results on ImageNet-1K. The code will be publicly available at GitHub.
Ali JamaliSwalpa Kumar RoyAvik BhattacharyaPedram Ghamisi
Tzu-Lun FangShen-Chin ChangYu‐Cheng Fan