JOURNAL ARTICLE

Contrastive-Aware ViT for Weakly Supervised Semantic Segmentation

Abstract

Class activation maps (CAMs) are crucial in weakly-supervised semantic segmentation (WSSS) tasks. However, challenges arise when the initial CAM quality is limited, leading to diminished performance during the refinement and post-processing stages. While Vision Transformers (ViTs) enhance initial CAMs using self-attention mechanisms and class tokens, they fail to leverage additional class-wise and patch-wise information. In this paper, we propose a contrastive learning approach to effectively utilize this information and generate superior initial CAMs. Our Contrastive-Aware ViT framework encompasses Patch-to-Patch (PtP) intra-image contrast, aligning patch representations within an image, and Inter-Class Image (IIC) contrast, aligning class-wise predictions across a batch of images. Evaluating on the PASCAL VOC 2012 dataset, our method achieves notable improvements of 1.4% and 1.6% over the MCTformer baseline in the train and val datasets, respectively. Ablation studies on PtP and IIC further demonstrate the superiority of our method across multiple diverse object cases, highlighting its effectiveness in WSSS tasks.

Keywords:
Computer science Segmentation Artificial intelligence Leverage (statistics) Pascal (unit) Transformer Pattern recognition (psychology) Contrast (vision) Class (philosophy) Machine learning Natural language processing

Metrics

0
Cited By
0.00
FWCI (Field Weighted Citation Impact)
19
Refs
0.20
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Topics

Advanced Neural Network Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Domain Adaptation and Few-Shot Learning
Physical Sciences →  Computer Science →  Artificial Intelligence
COVID-19 diagnosis using AI
Health Sciences →  Medicine →  Radiology, Nuclear Medicine and Imaging
© 2026 ScienceGate Book Chapters — All rights reserved.