In recent years, the application of Transformer in the realm of object tracking has yielded promising results, enabling researchers to extract comprehensive global information from image features. Nonetheless, it has come to light that the current state-of-the-art Transformer networks exhibit limitations in acquiring pertinent local information from image features. To address this issue, this paper presents a novel framework, namely the parallel multi-scale feature fusion Transformer (PFOT). The primary contributions of this study can be summarized as follows: 1) the incorporation of parallel multi-scale feature fusion involving both Convolutional Neural Networks (CNN) and Transformers, aimed at mitigating the deficiency of existing Transformer-based tracking networks in capturing local information; 2) the adoption of a progressive fusion approach to effectively reconcile the complexities arising from spatial disparities inherent in multi-scale features; 3) the pivotal role played by this fusion strategy in enhancing the network's overall performance. To evaluate the performance of our approach, extensive experiments have been conducted on three demanding datasets, attesting to the superiority of our PFOT network over state-of-the-art methods. Our experimental results on several benchmarks, including NFS, OTB100, and UAV123, demonstrate that PFOT achieves comparable performance to the state-of-the-art tracking algorithms.
Ziyang ZhangChuqing CaoFangjun Zheng
Yufan ZhangXinlong LiuLei DengJianxi Yang
Ying LüHuibing WangZhe ChenZheng Zhang