JOURNAL ARTICLE

ViTOL: Vision Transformer for Weakly Supervised Object Localization

Saurav GuptaSourav LakhotiaAbhay RawatRahul Tallamraju

Year: 2022 Journal:   2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) Pages: 4100-4109

Abstract

Weakly supervised object localization (WSOL) aims at predicting object locations in an image using only image-level category labels. Common challenges that image classification models encounter when localizing objects are, (a) they tend to look at the most discriminative features in an image that confines the localization map to a very small region, (b) the localization maps are class agnostic, and the models highlight objects of multiple classes in the same image and, (c) the localization performance is affected by background noise. To alleviate the above challenges we introduce the following simple changes through our proposed method ViTOL. We leverage the vision-based transformer for self-attention and introduce a patch-based attention dropout layer (p-ADL) to increase the coverage of the localization map and a gradient attention rollout mechanism to generate class-dependent attention maps. We conduct extensive quantitative, qualitative and ablation experiments on the ImageNet-1K and CUB datasets. We achieve state-of-the-art MaxBoxAcc-V2 localization scores of 70.47% and 73.17% on the two datasets respectively. Code is available on https://github.com/Saurav-31/ViTOL.

Keywords:
Artificial intelligence Computer science Discriminative model Leverage (statistics) Pattern recognition (psychology) Computer vision Contextual image classification Image (mathematics)

Metrics

27
Cited By
1.86
FWCI (Field Weighted Citation Impact)
45
Refs
0.89
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Advanced Neural Network Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Domain Adaptation and Few-Shot Learning
Physical Sciences →  Computer Science →  Artificial Intelligence
Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition

Related Documents

JOURNAL ARTICLE

CLIP-Driven Transformer for Weakly Supervised Object Localization

Zhiwei ChenYunhang ShenLiujuan CaoShengchuan ZhangRongrong Ji

Journal:   IEEE Transactions on Pattern Analysis and Machine Intelligence Year: 2025 Vol: 47 (6)Pages: 4878-4896
JOURNAL ARTICLE

Task-Aware Weakly Supervised Object Localization With Transformer

Meng MengTianzhu ZhangZhe ZhangYongdong ZhangFeng Wu

Journal:   IEEE Transactions on Pattern Analysis and Machine Intelligence Year: 2022 Vol: 45 (7)Pages: 1-13
JOURNAL ARTICLE

Reperceive Global Vision of Transformer for Remote Sensing Images Weakly Supervised Object Localization

Xuran HuMingzhe ZhuZhengpeng FengLjubiša Stanković

Journal:   IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing Year: 2024 Vol: 17 Pages: 16902-16916
© 2026 ScienceGate Book Chapters — All rights reserved.