All in One: Exploring Unified Vision-Language Tracking with Multi-Modal Alignment

Chunhui Zhang; Sun Xin; Yiqian Yang; Li Liu; Qiong Liu; Xi Zhou; Yanfeng Wang

doi:10.1145/3581783.3611803

ScienceGate Book Chapters

JOURNAL ARTICLE

All in One: Exploring Unified Vision-Language Tracking with Multi-Modal Alignment

Chunhui Zhang Sun Xin Yiqian Yang Li Liu Qiong Liu Xi Zhou Yanfeng Wang

Year: 2023 Pages: 5552-5561

DOI: 10.1145/3581783.3611803

Get Full-Text PDF Get Analytical Report

Abstract

Current mainstream vision-language (VL) tracking framework consists of three parts,i.e., a visual feature extractor, a language feature extractor, and a fusion model. To pursue better performance, a natural modus operandi for VL tracking is employing customized and heavier unimodal encoders, and multi-modal fusion models. Albeit effective, existing VL trackers separate feature extraction and feature integration, resulting in extracted features that lack semantic guidance and have limited target-aware capability in complex scenarios, e.g., similar distractors and extreme illumination. In this work, inspired by the recent success of exploring foundation models with unified architecture for both natural language and computer vision tasks, we propose an All-in-One framework, which learns joint feature extraction and interaction by adopting a unified transformer backbone. Specifically, we mix raw vision and language signals to generate language-injected vision tokens, which we then concatenate before feeding into the unified backbone architecture. This approach achieves feature integration in a unified backbone, removing the need for carefully-designed fusion modules and resulting in a more effective and efficient VL tracking framework. To further improve the learning efficiency, we introduce a multi-modal alignment module based on cross-modal and intra-modal contrastive objectives, providing more reasonable representations for the unified All-in-One transformer backbone. Extensive experiments on five benchmarks, i.e., OTB99-L, TNL2K, LaSOT, LaSOTExt and WebUAV-3M, demonstrate the superiority of the proposed tracker against existing state-of-the-art (SOTA) methods on VL tracking. Codes will be available at https://github.com/983632847/All-in-One here.

Keywords:

Computer science Artificial intelligence Feature extraction BitTorrent tracker Feature (linguistics) Transformer Modal Natural language Computer vision Eye tracking Natural language processing Engineering

Metrics

Cited By

4.00

FWCI (Field Weighted Citation Impact)

Refs

0.93

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Video Surveillance and Tracking Methods

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Human Pose and Action Recognition

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

All in One: Exploring Unified Vision-Language Tracking with Multi-Modal Alignment

Abstract

Metrics

Citation History

Topics

Related Documents

Multi-Modal Object Tracking with Vision-Language Adaptive Fusion and Alignment

Textual Tokens Classification for Multi-Modal Alignment in Vision-Language Tracking

Multi-Modal Hybrid Interaction Vision-Language Tracking

Dual-stream Multi-modal Interactive Vision-language Tracking

UMPA: Unified multi-modal prompt with adapter for vision-language models