Inspired by Transformer, this paper proposes a new attention-based feature fusion network, which effectively combines template features and search region features using attention alone. Specifically, the method includes a contextual enhancement module based on multi-headed self-attention and a cross-feature enhancement module based on crossattention, and finally the two features are combined using the residual structure to effectively enhance the features. Experiments show that our tracker achieves very good results on the GOT-10k benchmark. It runs at approximately 45fps on the GPU, which achieves the real-time requirement.
Jia HuXiaoping FanShengzong LiuLirong Huang
Shuxian WangHaibo GeWenhao LiLi'Ang LiuTing ZhouShenghua Yang
Qingbo JiKuicheng ChenChangbo HouZiqi LiYufei Qi
Xinping PanZhen WangXiaolin ShiJunjie Li