Fine-grained visual classification is an important task in the field of computer vision, where large intra-class differences and small inter-class differences are one of the main challenges. Traditional methods require manual feature extraction and do not handle small differences between classes well, making them unsuitable for large-scale and high-dimensional image data. In recent years, the Vision Transformer (ViT) has achieved good results in general image recognition tasks, and its self-attentive mechanism makes it suitable for fine-grained visual classification(FGVC). By effectively capturing fine features in the image through the self-attention mechanism, the need for manual feature extraction is avoided. To effectively mine cross-layer features, the Cross-Layer Fine-grained Feature module(CLFF) is proposed, and the data augmentation method of attention cropping and erasure is introduced to improve the performance of fine-grained classification. We conducted experiments on the NABirds, CUB-200-2011, and Stanford Dogs datasets, and the results show that our method outperforms current state-of-the-art methods in terms of accuracy.
Jun WangXiaoming YuYongsheng Gao
Min HuangSaixing ZhuZehua WangShuanghong Qu
Zhang KaiyueYongjiang XueLing DuQingzeng Song
Guanglei ShengGang HuXiaofeng WangWei ChenJinling JiangQuanquan Xiao
Chin‐Feng LaiYi-Wei LaiShih-Yeh ChenChi-Hsuan LeeMu‐Yen Chen