Keyang ChengJingfeng TangHongjian GuHao WanMaozhen Li
Most existing Vision Transformer-based frameworks for weakly supervised semantic segmentation utilize class activation maps to generate pseudo masks. Although it mitigates the class-agnostic issue, this approach still suffers from misclassification and noise in segmentation results. To overcome these limitations, we propose an attention-based framework named Cross-block Sparse Class Token Contrast (CB-SCTC), which incorporates Dynamic Sparse Attention module (DSA) and Cross-block Class Token Contrast scheme (CB-CTC). Specifically, the proposed Cross-block Class Token Contrast scheme forces diversity between the final class tokens by learning from the lower similarity of the class tokens in the relatively shallower blocks. Moreover, the Dynamic Sparse Attention module is designed to post-process the output from the softmax function in the attention mechanism to reduce noise. Extensive experiments prove the proposed framework is a valid alternative to class activation maps. Our framework demonstrates competitive mIoU scores on the PASCAL VOC 2012(val:75.5%, test:75.2%) and MS COCO 2014 dataset(val:46.9%). Our code is available at https://github.com/Jingfeng-Tang/CB-SCTC.
Lixiang RuHeliang ZhengYibing ZhanBo Du
Lian XuWanli OuyangMohammed BennamounFarid BoussaïdDan Xu
Lian XuMohammed BennamounFarid BoussaïdHamid LagaWanli OuyangDan Xu
Pengcheng GuoZhen WangJunhuan PengYuebin WangGuodong LiuYasong MiDengxiang WuJie Huang