RAMS-Trans: Recurrent Attention Multi-scale Transformer for Fine-grained Image Recognition

Yunqing Hu; Xuan Jin; Yin Zhang⋆; Haiwen Hong; Jingfeng Zhang; Yuan He; Hui Xue

doi:10.1145/3474085.3475561

ScienceGate Book Chapters

JOURNAL ARTICLE

RAMS-Trans: Recurrent Attention Multi-scale Transformer for Fine-grained Image Recognition

Yunqing Hu Xuan Jin Yin Zhang⋆Haiwen Hong Jingfeng Zhang Yuan He Hui Xue

Year: 2021 Pages: 4239-4248

DOI: 10.1145/3474085.3475561

Get Full-Text PDF Get Analytical Report

Abstract

In fine-grained image recognition (FGIR), the localization and amplification of region attention is an important factor, which has been explored extensively convolutional neural networks (CNNs) based approaches. The recently developed vision transformer (ViT) has achieved promising results in computer vision tasks. Compared with CNNs, Image sequentialization is a brand new manner. However, ViT is limited in its receptive field size and thus lacks local attention like CNNs due to the fixed size of its patches, and is unable to generate multi-scale features to learn discriminative region attention. To facilitate the learning of discriminative region attention without box/part annotations, we use the strength of the attention weights to measure the importance of the patch tokens corresponding to the raw images. We propose the recurrent attention multi-scale transformer (RAMS-Trans), which uses the transformer's self-attention to recursively learn discriminative region attention in a multi-scale manner. Specifically, at the core of our approach lies the dynamic patch proposal module (DPPM) responsible for guiding region amplification to complete the integration of multi-scale image patches. The DPPM starts with the full-size image patches and iteratively scales up the region attention to generate new patches from global to local by the intensity of the attention weights generated at each scale as an indicator. Our approach requires only the attention weights that come with ViT itself and can be easily trained end-to-end. Extensive experiments demonstrate that RAMS-Trans performs better than exising works, in addition to efficient CNN models, achieving state-of-the-art results on three benchmark datasets.

Keywords:

Discriminative model Computer science Artificial intelligence Transformer Pattern recognition (psychology) Convolutional neural network Computer vision Engineering

Metrics

123

Cited By

9.00

FWCI (Field Weighted Citation Impact)

Refs

0.98

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Advanced Neural Network Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Advanced Image and Video Retrieval Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Generative Adversarial Networks and Image Synthesis

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

RAMS-Trans: Recurrent Attention Multi-scale Transformer for Fine-grained Image Recognition

Abstract

Metrics

Citation History

Topics

Related Documents

Group-Attention Transformer for Fine-Grained Image Recognition

Multi-scale Transformer with External Attention for fine-grained image classification

Multi-Scale CNN for Fine-Grained Image Recognition

Multi-Attention Multi-Class Constraint for Fine-grained Image Recognition

Fine‑Grained Image Recognition Method Based on Attention and Multi‑scale Ensemble Learning