JOURNAL ARTICLE

Object Detection via Multi-Scale Token Based on Vision Transformer

Abstract

Visual transformers have achieved impressive performance on object detection. Traditional transformers only focus on multi-scale features between tokens and tokens. However, these methods do not pay attention to the fine-grained features inside a single token, which can lead to the loss of semantic information in the object detection task. To address this issue, we propose a novel network for the above problem, which consists of three components, (1) Internal Multiscale Token Module (IMTM) focuses on the receptive field size of each token and transforms the token dimension size to effectively extract more multiscale features within the self-attention layer, thereby improving the performance and generalization ability of the model. (2) Differential Filter Module (DFM) uses a convolutional network to focus on high-frequency information in the image, helping the Transformer to learn edge features and establish local context, while improving the model performance through residual connections. (3) Feature Fusion Module (FFM) enhances the local and global information extracted by the network by fusing information from different dimensions. Extensive experiments on PASCAL VOC shows that our proposed method can achieve a state-of-the-art performance on object detection.

Keywords:
Computer science Security token Computer vision Artificial intelligence Object detection Transformer Pattern recognition (psychology) Engineering Electrical engineering Computer network Voltage

Metrics

1
Cited By
0.18
FWCI (Field Weighted Citation Impact)
28
Refs
0.47
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Advanced Image and Video Retrieval Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Image and Object Detection Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Advanced Neural Network Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
© 2026 ScienceGate Book Chapters — All rights reserved.