JOURNAL ARTICLE

Efficient Vision Transformer via Token Merger

Zhanzhou FengShiliang Zhang

Year: 2023 Journal:   IEEE Transactions on Image Processing Vol: 32 Pages: 4156-4169   Publisher: Institute of Electrical and Electronics Engineers

Abstract

Vision Transformers (ViTs) split an image into fixed-size patches as tokens. This strategy has succeeded in computer vision tasks, but introduces considerable tokens similar in semantics and appearances. This work proposes Token Merger to spot redundant tokens and merge them into a compact representation to accelerate ViTs. For each forward inference, the Token Merger first identifies meta tokens to represent meaningful cues of the image content, then adaptively merges similar tokens into a uniform one referring to meta tokens. To pursue a reasonable tradeoff between accuracy and efficiency, we further introduce learnable gates to adaptively decide the token merge ratios of different layers. As a generalizable module, Token Merger can be easily plugged into different layers of ViTs to boost their efficiency. Visualizations show that Token Merger progressively merges tokens and finally learns a compact set of tokens representing clear semantics. Compared with token pruning methods, Token Merger is more effective in preserving meaning contextual cues, thus performs and generalizes substantially better in different vision tasks. Extensive experiments and comparisons with other state-of-the-art downsampling methods also demonstrate its promising performance. For instance, it reduces 95% tokens and accelerates the inference speed by 62%. Meanwhile, the ImageNet classification accuracy only drops by 0.4%. The code will be available.

Keywords:
Security token Computer science Upsampling Merge (version control) Inference Artificial intelligence Token passing Transformer Computer vision Information retrieval Image (mathematics) Computer network

Metrics

24
Cited By
4.37
FWCI (Field Weighted Citation Impact)
136
Refs
0.93
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Advanced Neural Network Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Domain Adaptation and Few-Shot Learning
Physical Sciences →  Computer Science →  Artificial Intelligence
Advanced Image and Video Retrieval Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition

Related Documents

JOURNAL ARTICLE

Adaptive class token knowledge distillation for efficient vision transformer

Minchan KangSanghyeok SonDae‐Shik Kim

Journal:   Knowledge-Based Systems Year: 2024 Vol: 304 Pages: 112531-112531
JOURNAL ARTICLE

AToM: Adaptive Token Merging for Efficient Acceleration of Vision Transformer

Jaekang ShinMyeonggu KangYunki HanJunyoung ParkLee‐Sup Kim

Journal:   IEEE Transactions on Computers Year: 2025 Vol: 74 (5)Pages: 1620-1633
JOURNAL ARTICLE

An Energy-Efficient High Resolution Vision Transformer Processor Exploiting Token Similarity Beyond Token Merging

Jungjun OhSangjin KimJiwon ChoiJunha RyuByeongcheol KimYuseon ChoiHoi-Jun Yoo

Journal:   IEEE Transactions on Very Large Scale Integration (VLSI) Systems Year: 2025 Vol: 34 (1)Pages: 118-129
© 2026 ScienceGate Book Chapters — All rights reserved.