Fine-Grained Visual Classification via Adaptive Attention Quantization Transformer

Shishi Qiao; S. H. Li; Haiyong Zheng

doi:10.1109/tnnls.2025.3643809

ScienceGate Book Chapters

JOURNAL ARTICLE

Fine-Grained Visual Classification via Adaptive Attention Quantization Transformer

Shishi Qiao S. H. Li Haiyong Zheng

Year: 2025 Journal: IEEE Transactions on Neural Networks and Learning Systems Vol: PP Pages: 1-15 Publisher: Institute of Electrical and Electronics Engineers

DOI: 10.1109/tnnls.2025.3643809

Get Full-Text PDF Get Analytical Report

Abstract

Vision transformer (ViT) has recently demonstrated remarkable performance in fine-grained visual classification (FGVC). However, most existing ViT-based methods often overlook the varied focus of different attention heads, in which heads that attend to nondiscriminative regions would dilute the discriminative signal crucial for FGVC. To address such issues, we propose a novel adaptive attention quantization transformer (A²QTrans) for FGVC to select the key discriminative features by analyzing the heads' attention, which comprises three key modules: the adaptive quantization selection (AQS) module, the background elimination (BE) module, and the dynamic hybrid optimization (DHO) module. Specifically, the AQS module dynamically selects the most discriminative features in a data-driven manner by quantizing the attention scores across multiple attention heads with a global, learnable threshold. This process effectively filters out generally irrelevant information from nondiscriminative tokens, thus concentrating attention on important regions. To address the nondifferentiability inherent in updating this threshold during binarization, our AQS module employs a straight-through estimator (STE) for discrete optimization, enabling end-to-end gradient backpropagation. In addition, we utilize the prior that background regions usually do not contain meaningful information, and design the BE module to further calibrate the focus of the attention heads to the main objects in images. Finally, the DHO module adaptively optimizes and integrates the attentive results of the AQS and BE modules to achieve optimal classification performance. Extensive experiments conducted on four challenging FGVC benchmark datasets and three ViT variants demonstrate A²QTrans's superior performance, achieving state-of-the-art (SOTA) results. The source code is available at https://github.com/Lishixian0817/A2QTrans.

Keywords:

Metrics

Cited By

0.00

FWCI (Field Weighted Citation Impact)

Refs

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Fine-Grained Visual Classification via Adaptive Attention Quantization Transformer

Abstract

Metrics

Topics

Related Documents

A Transformer Architecture with Adaptive Attention for Fine-Grained Visual Classification

Hierarchical attention vision transformer for fine-grained visual classification

Dual-Dependency Attention Transformer for Fine-Grained Visual Classification

Havt: Hierarchical Attention Vision Transformer for Fine-Grained Visual Classification

TransFGVC: transformer-based fine-grained visual classification