Efficient Vision Transformer via Token Merger

Zhanzhou Feng; Shiliang Zhang

doi:10.1109/tip.2023.3293763

ScienceGate Book Chapters

JOURNAL ARTICLE

Efficient Vision Transformer via Token Merger

Zhanzhou Feng Shiliang Zhang

Year: 2023 Journal: IEEE Transactions on Image Processing Vol: 32 Pages: 4156-4169 Publisher: Institute of Electrical and Electronics Engineers

DOI: 10.1109/tip.2023.3293763

Get Full-Text PDF Get Analytical Report

Abstract

Vision Transformers (ViTs) split an image into fixed-size patches as tokens. This strategy has succeeded in computer vision tasks, but introduces considerable tokens similar in semantics and appearances. This work proposes Token Merger to spot redundant tokens and merge them into a compact representation to accelerate ViTs. For each forward inference, the Token Merger first identifies meta tokens to represent meaningful cues of the image content, then adaptively merges similar tokens into a uniform one referring to meta tokens. To pursue a reasonable tradeoff between accuracy and efficiency, we further introduce learnable gates to adaptively decide the token merge ratios of different layers. As a generalizable module, Token Merger can be easily plugged into different layers of ViTs to boost their efficiency. Visualizations show that Token Merger progressively merges tokens and finally learns a compact set of tokens representing clear semantics. Compared with token pruning methods, Token Merger is more effective in preserving meaning contextual cues, thus performs and generalizes substantially better in different vision tasks. Extensive experiments and comparisons with other state-of-the-art downsampling methods also demonstrate its promising performance. For instance, it reduces 95% tokens and accelerates the inference speed by 62%. Meanwhile, the ImageNet classification accuracy only drops by 0.4%. The code will be available.

Keywords:

Security token Computer science Upsampling Merge (version control) Inference Artificial intelligence Token passing Transformer Computer vision Information retrieval Image (mathematics) Computer network

Metrics

Cited By

4.37

FWCI (Field Weighted Citation Impact)

136

Refs

0.93

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Advanced Neural Network Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Domain Adaptation and Few-Shot Learning

Physical Sciences → Computer Science → Artificial Intelligence

Advanced Image and Video Retrieval Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Efficient Vision Transformer via Token Merger

Abstract

Metrics

Citation History

Topics

Related Documents

No Token Left Behind: Efficient Vision Transformer via Dynamic Token Idling

Adaptive class token knowledge distillation for efficient vision transformer

AToM: Adaptive Token Merging for Efficient Acceleration of Vision Transformer

An Energy-Efficient High Resolution Vision Transformer Processor Exploiting Token Similarity Beyond Token Merging

DualToken-ViT: Position-aware Efficient Vision Transformer with Dual Token Fusion