Narges Javid TajrishiSepehr Amini AfsharShohreh Kasaei
Weakly-Supervised Semantic Segmentation (WSSS) with image-level labels, commonly uses Class Activation Maps (CAM) to generate pseudo-labels. However, Convolutional Neural Networks (CNNs), with their limited local receptive field, often struggle to identify entire object regions. Recently, the Vision Transformer (ViT) architecture has been employed instead of CNNs to capture long-range feature dependencies, by using the self-attention mechanism. Despite its advantages, ViT tends to overlook local feature details, leading to attention maps with low quality and unclear object details. This paper introduces a novel method to enhance the local details in attention maps by leveraging local patches. These local patches are selected from regions that are more likely to contain the desired objects. By effectively utilizing these local patches during the training and generation stages, the model yields more detailed attention maps. Extensive experiments were conducted on the PASCAL VOC 2012 benchmark dataset to demonstrate the efficacy of the proposed approach. The results show significant improvements (+2.6% mIoU) with minimal computational overhead, underscoring the potential of the proposed method in the field of Weakly-Supervised Semantic Segmentation.
Fei ZhangChaochen GuChenyue ZhangYuchao Dai
Zhiming DongJiajun WangBo CuiDong WangXiaoling Wang
Patrick P. K. ChanKeke ChenLinyi XuXiaoman HuDaniel Yeung
Wangyu WuTianhong DaiZhenhong ChenXiaowei HuangJimin XiaoFei MaOuyang Ren-rong
Jinlong LiZequn JieXu WangYu ZhouXiaolin WeiLin Ma