Lianyi BaoSongjian ChenFei Han
At present, crowd counting under complex background is still a big challenge, but a meaningful task for public safety. We focus on this problem and propose a multi-scale multi-clue crowd counting network (MMNet), which is composed of a feature encoder backbone and four stacked multi-clue crowd estimation modules (MCEM) under multiple scales as decoders. Each module consists of three predictors, including a shared attention predictor (SAP), a density map predictor (DMP) and a local counting map predictor (LCMP). DMP utilizes the information of each pixel on the image, while LCMP divides the image into patches and counts the number of people on these patches, focusing on the number in each patch. These two predictors solve the problem of inaccurate crowd counting under complex background from the perspective of training target. They use the microscopic information and macro information of the image for model training, respectively. SAP helps them concentrate more on the human head region in the image by generating multi-scale shared attention maps from the perspective of feature extraction. Furthermore, we design a multi-task joint training strategy that automatically adjusts the loss weights of different tasks to promote training and the robustness of the model. Extensive experiments on three challenging datasets (ShanghaiTech, UCF_CC_50, UCF-QNRF) show the superior performance of MMNet.
Yongjie WangWei ZhangDongxiao HuangYanyan LiuJianghua Zhu
Jie ZouYingying LiZijian HuYong Wang
Ying ShiJun SangMohammad S. AlamXinyue LiuShaoli Tian
Pengze WangWei WuYang SuXin LiDuan Yong-sheng