JOURNAL ARTICLE

Cross-Layer Feature Fusion Vision Transformer for Fine-Grained Visual Classification

Abstract

Fine-grained visual classification is an important task in the field of computer vision, where large intra-class differences and small inter-class differences are one of the main challenges. Traditional methods require manual feature extraction and do not handle small differences between classes well, making them unsuitable for large-scale and high-dimensional image data. In recent years, the Vision Transformer (ViT) has achieved good results in general image recognition tasks, and its self-attentive mechanism makes it suitable for fine-grained visual classification(FGVC). By effectively capturing fine features in the image through the self-attention mechanism, the need for manual feature extraction is avoided. To effectively mine cross-layer features, the Cross-Layer Fine-grained Feature module(CLFF) is proposed, and the data augmentation method of attention cropping and erasure is introduced to improve the performance of fine-grained classification. We conducted experiments on the NABirds, CUB-200-2011, and Stanford Dogs datasets, and the results show that our method outperforms current state-of-the-art methods in terms of accuracy.

Keywords:
Computer science Artificial intelligence Fusion Computer vision Transformer Pattern recognition (psychology) Feature (linguistics) Engineering Voltage Electrical engineering

Metrics

2
Cited By
0.36
FWCI (Field Weighted Citation Impact)
24
Refs
0.55
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Advanced Image and Video Retrieval Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Advanced Neural Network Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Industrial Vision Systems and Defect Detection
Physical Sciences →  Engineering →  Industrial and Manufacturing Engineering
© 2026 ScienceGate Book Chapters — All rights reserved.