BOOK-CHAPTER

DualToken-ViT: Position-aware Efficient Vision Transformer with Dual Token Fusion

Zhenzhen ChuJiayu ChenCen ChenChengyu WangZiheng WuJun HuangWeining Qian

Year: 2024 Society for Industrial and Applied Mathematics eBooks Pages: 688-696   Publisher: Society for Industrial and Applied Mathematics

Abstract

Self-attention-based vision transformers (ViTs) have emerged as a highly competitive architecture in computer vision. Unlike convo-lutional neural networks (CNNs), ViTs are capable of global information sharing. With the development of various structures of ViTs, ViTs are increasingly advantageous for many vision tasks. However, the quadratic complexity of self-attention renders ViTs computationally intensive, and their lack of inductive biases of locality and translation equivariance demands larger model sizes compared to CNNs to effectively learn visual features. In this paper, we propose a light-weight and efficient vision transformer model called DualToken-ViT that leverages the advantages of CNNs and ViTs. DualToken-ViT effectively fuses the token with local information obtained by convolution-based structure and the token with global information obtained by self-attention-based structure to achieve an efficient attention structure. In addition, we use position-aware global tokens throughout all stages to enrich the global information, which further strengthening the effect of DualToken-ViT. Position-aware global tokens also contain the position information of the image, which makes our model better for vision tasks. We conducted extensive experiments on image classification, object detection and semantic segmentation tasks to demonstrate the effectiveness of DualToken-ViT. On the ImageNet-1K dataset, our models of different scales achieve accuracies of 75.4% and 79.4% with only 0.5G and 1.0G FLOPs, respectively, and our model with 1.0G FLOPs outperforms LightViT-T using global tokens by 0.7%.

Keywords:
Computer vision Security token Computer science Transformer Artificial intelligence Dual (grammatical number) Fusion Electrical engineering Engineering Computer network Art Philosophy

Metrics

2
Cited By
5.54
FWCI (Field Weighted Citation Impact)
0
Refs
0.94
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Robotics and Sensor-Based Localization
Physical Sciences →  Engineering →  Aerospace Engineering
CCD and CMOS Imaging Sensors
Physical Sciences →  Engineering →  Electrical and Electronic Engineering
Image Processing Techniques and Applications
Physical Sciences →  Engineering →  Media Technology

Related Documents

JOURNAL ARTICLE

MDA-ViT: Multimodal image fusion using dual attention vision transformer

Shrida KalamkarGeetha Mary Amalanathan

Journal:   Multimedia Tools and Applications Year: 2024 Vol: 84 (21)Pages: 23701-23723
JOURNAL ARTICLE

Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer

Yifan XuZhijie ZhangMengdan ZhangKekai ShengKe LiWeiming DongLiqing ZhangChangsheng XuXing Sun

Journal:   Proceedings of the AAAI Conference on Artificial Intelligence Year: 2022 Vol: 36 (3)Pages: 2964-2972
JOURNAL ARTICLE

WINter-ViT : Window Interaction Vision Transformer with Head-Aware Attention

Jihyeok KimJaehyeok KimSo-Yun ParkJinwoo Yoo

Journal:   The Transactions of The Korean Institute of Electrical Engineers Year: 2025 Vol: 74 (9)Pages: 1581-1590
JOURNAL ARTICLE

Efficient Vision Transformer via Token Merger

Zhanzhou FengShiliang Zhang

Journal:   IEEE Transactions on Image Processing Year: 2023 Vol: 32 Pages: 4156-4169
© 2026 ScienceGate Book Chapters — All rights reserved.