Jiahao BaoMenglong YanYiran YangKaiqiang Chen
Siamese network-based trackers have developed rapidly in the field of visual object tracking recently. Many Siamese network-based trackers currently in use rely on result fusion to combine the classification result map and regression result map. However, these result maps are obtained from the multi-level feature map and are independent of each other. It is inappropriate and flawed to use result fusion. Additionally, classification module and regression module are independent of each other, which leads to feature misalignment. In this paper, we propose a feature-fusion approach that involves fusing similarity response maps using a novel scale attention mechanism and subsequently decoding the features. To reduce the feature misalignment and produce more precise tracking results, we suggest using Classification Supervised Regression Loss (CSRL), to train the model. Experiments conducted on three challenging benchmark datasets show that this method outperforms current models in terms of both performance and efficiency, running at 40 fps.
Da LiYabing KangXing XiangWensheng TaoJiwei Hu
Qiongrui LiuXiyi WangWenjie WuXilin Zhu
Zhixi WuBaichen LiuShunzhi Zhu
Da LiYuyang LuoSong JinWenqi Huang
Dongyan GuoJun WangWeixuan ZhaoYing CuiZhenhua WangShengyong Chen