Multi-Modal Object Tracking with Vision-Language Adaptive Fusion and Alignment

Jiachen Zuo; Tao Wu; Meiping Shi; Xueyan Liu; Xijun Zhao

doi:10.1109/ricai60863.2023.10489325

ScienceGate Book Chapters

JOURNAL ARTICLE

Multi-Modal Object Tracking with Vision-Language Adaptive Fusion and Alignment

Jiachen Zuo Tao Wu Meiping Shi Xueyan Liu Xijun Zhao

Year: 2023 Pages: 1125-1133

DOI: 10.1109/ricai60863.2023.10489325

Get Full-Text PDF Get Analytical Report

Abstract

The current CLIP-based multi-modal deep learning has been widely used for various tasks and has achieved leading results. Traditional vision-language multi-modal tracking paradigms extract vision and language features separately and then fuse them, which is difficult to train end-to-end. This paper proposed a framework for synchronous feature extraction and fusion of vision and language. Specifically, adaptive multi-modal attention operations are performed on language-template and language-search region respectively. Then a vision module performs self-attention and cross-attention between template and search region, integrating the long and short-term clues of the target to model the position of the target in the search region. Finally, model learns a unified feature representation through the modality alignment module. By contrastive learning, this method can extract and fuse vision-language features efficiently. Extensive experiments are conducted on three benchmark tests, TNL2K, OTB99-Lang and LaSOT. Compared with other state-of-the-art methods, this method achieved promising results and outperformed state-of-the-art on TNL2K and OTB99-Lang. Finally, this paper conducted plenty of ablation experiments to demonstrate the effectiveness of this method.

Keywords:

Computer vision Computer science Artificial intelligence Modal Tracking (education) Object (grammar) Fusion Sensor fusion Video tracking Linguistics Psychology

Metrics

Cited By

0.18

FWCI (Field Weighted Citation Impact)

Refs

0.48

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Advanced Image and Video Retrieval Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Video Surveillance and Tracking Methods

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Robotic Path Planning Algorithms

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Multi-Modal Object Tracking with Vision-Language Adaptive Fusion and Alignment

Abstract

Metrics

Citation History

Topics

Related Documents

All in One: Exploring Unified Vision-Language Tracking with Multi-Modal Alignment

Mamba Adapter: Efficient Multi-Modal Fusion for Vision-Language Tracking

Textual Tokens Classification for Multi-Modal Alignment in Vision-Language Tracking

Multimodal Features Alignment for Vision–Language Object Tracking

MAFMO: Multi-modal Adaptive Fusion with Meta-template Optimization for Vision-Language Models