JOURNAL ARTICLE

Multi-Modal Object Tracking with Vision-Language Adaptive Fusion and Alignment

Abstract

The current CLIP-based multi-modal deep learning has been widely used for various tasks and has achieved leading results. Traditional vision-language multi-modal tracking paradigms extract vision and language features separately and then fuse them, which is difficult to train end-to-end. This paper proposed a framework for synchronous feature extraction and fusion of vision and language. Specifically, adaptive multi-modal attention operations are performed on language-template and language-search region respectively. Then a vision module performs self-attention and cross-attention between template and search region, integrating the long and short-term clues of the target to model the position of the target in the search region. Finally, model learns a unified feature representation through the modality alignment module. By contrastive learning, this method can extract and fuse vision-language features efficiently. Extensive experiments are conducted on three benchmark tests, TNL2K, OTB99-Lang and LaSOT. Compared with other state-of-the-art methods, this method achieved promising results and outperformed state-of-the-art on TNL2K and OTB99-Lang. Finally, this paper conducted plenty of ablation experiments to demonstrate the effectiveness of this method.

Keywords:
Computer vision Computer science Artificial intelligence Modal Tracking (education) Object (grammar) Fusion Sensor fusion Video tracking Linguistics Psychology

Metrics

1
Cited By
0.18
FWCI (Field Weighted Citation Impact)
31
Refs
0.48
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Advanced Image and Video Retrieval Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Video Surveillance and Tracking Methods
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Robotic Path Planning Algorithms
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
© 2026 ScienceGate Book Chapters — All rights reserved.