Class activation maps (CAMs) are crucial in weakly-supervised semantic segmentation (WSSS) tasks. However, challenges arise when the initial CAM quality is limited, leading to diminished performance during the refinement and post-processing stages. While Vision Transformers (ViTs) enhance initial CAMs using self-attention mechanisms and class tokens, they fail to leverage additional class-wise and patch-wise information. In this paper, we propose a contrastive learning approach to effectively utilize this information and generate superior initial CAMs. Our Contrastive-Aware ViT framework encompasses Patch-to-Patch (PtP) intra-image contrast, aligning patch representations within an image, and Inter-Class Image (IIC) contrast, aligning class-wise predictions across a batch of images. Evaluating on the PASCAL VOC 2012 dataset, our method achieves notable improvements of 1.4% and 1.6% over the MCTformer baseline in the train and val datasets, respectively. Ablation studies on PtP and IIC further demonstrate the superiority of our method across multiple diverse object cases, highlighting its effectiveness in WSSS tasks.
Ziwei LuoTao ZengXinyi JiangQingyu PengYing MaZhong XieXiong Pan
Sangtae KimDaeyoung ParkByonghyo Shim
Wentian CaiWeixian YangJing LinYing Gao
Anurag DasYongqin XianDengxin DaiBernt Schiele
Zhiyuan CaoYufei GaoJiacai Zhang