Xin DengSongjian ChenYifan ChenJie-Fang Xu
Most existing crowd counting methods have focused on pure convolutional neural network based supervised algorithms. Although these methods have attained good results on some datasets, they still encounter several common problems. The cost of labeling annotations for supervised methods is huge and the shortage of labeled datasets limits the further development of supervised algorithms for crowd counting. Meanwhile, pure CNN-based algorithms have certain limitations in building the connections among these features. To overcome those problems, we proposed a semi-supervised crowd counting algorithm that is a mixture model of CNN and transformer. Specifically, our method consists of two parts Multi-Level Convolutional Transformer (MLCT) and Adaptive Scale Module (ASM). MLCT is the counting branch, with its front end and back end being the CNN and the transformer, respectively. ASM outputs an adaptive scale factor for the unlabeled crowd images. We generate a ranking list based on this factor, which is fed into the MLCT and computes loss by the order of the list. Different from most crowd counting methods, we use a region-level regression target for labeled images, which is a weaker regression approach than the location regression. Furthermore, We train the entire model using a novel loss function that combines L1 loss and ranking loss. Experimental results on the three challenging datasets ShanghaiTech Part A, ShanghaiTech Part B, and UCF-QNRF have all demonstrated the effectiveness of the proposed approach.
Zhuangzhuang MiaoYong ZhangPeng YuanHaocheng PengBaocai Yin
Mingwei YaoKehua GuoLingyan ZhangXuyang TanXiaokang Zhou
Zhuangzhuang MiaoYong ZhangHao RenYongli HuBaocai Yin
Yongjie WangWei ZhangDongxiao HuangYanyan LiuJianghua Zhu