The current CLIP-based multi-modal deep learning has been widely used for various tasks and has achieved leading results. Traditional vision-language multi-modal tracking paradigms extract vision and language features separately and then fuse them, which is difficult to train end-to-end. This paper proposed a framework for synchronous feature extraction and fusion of vision and language. Specifically, adaptive multi-modal attention operations are performed on language-template and language-search region respectively. Then a vision module performs self-attention and cross-attention between template and search region, integrating the long and short-term clues of the target to model the position of the target in the search region. Finally, model learns a unified feature representation through the modality alignment module. By contrastive learning, this method can extract and fuse vision-language features efficiently. Extensive experiments are conducted on three benchmark tests, TNL2K, OTB99-Lang and LaSOT. Compared with other state-of-the-art methods, this method achieved promising results and outperformed state-of-the-art on TNL2K and OTB99-Lang. Finally, this paper conducted plenty of ablation experiments to demonstrate the effectiveness of this method.
Chunhui ZhangSun XinYiqian YangLi LiuQiong LiuXi ZhouYanfeng Wang
Liangtao ShiBineng ZhongQihua LiangXiantao HuZhiyi MoShuxiang Song
Zhongjie MaoYucheng WangXi ChenJia Yan