OpenVIS: Open-vocabulary Video Instance Segmentation

Pinxue Guo; Hao Huang; Peiyang He; Xuefeng Liu; Tianjun Xiao; Wenqiang Zhang

doi:10.1609/aaai.v39i3.32338

ScienceGate Book Chapters

JOURNAL ARTICLE

OpenVIS: Open-vocabulary Video Instance Segmentation

Pinxue Guo Hao Huang Peiyang He Xuefeng Liu Tianjun Xiao Wenqiang Zhang

Year: 2025 Journal: Proceedings of the AAAI Conference on Artificial Intelligence Vol: 39 (3)Pages: 3275-3283 Publisher: Association for the Advancement of Artificial Intelligence

DOI: 10.1609/aaai.v39i3.32338

Get Full-Text PDF Get Analytical Report

Abstract

Open-vocabulary Video Instance Segmentation (OpenVIS) can simultaneously detect, segment, and track arbitrary object categories in a video, without being constrained to categories seen during training. In this work, we propose InstFormer, a carefully designed framework for the OpenVIS task that achieves powerful open-vocabulary capabilities through lightweight fine-tuning with limited-category data. InstFormer begins with the open-world mask proposal network, encouraged to propose all potential instance class-agnostic masks by the contrastive instance margin loss. Next, we introduce InstCLIP, adapted from pre-trained CLIP with Instance Guidance Attention, which encodes open-vocabulary instance tokens efficiently. These instance tokens not only enable open-vocabulary classification but also offer strong universal tracking capabilities. Furthermore, to prevent the tracking module from being constrained by the training data with limited categories, we propose the universal rollout association, which transforms the tracking problem into predicting the next frame’s instance tracking token. The experimental results demonstrate the proposed InstFormer achieve state-of-the-art capabilities on a comprehensive OpenVIS evaluation benchmark, while also achieves competitive performance in fully supervised VIS task.

Keywords:

Computer science Segmentation Vocabulary Artificial intelligence Computer vision Natural language processing Linguistics

Metrics

Cited By

3.67

FWCI (Field Weighted Citation Impact)

Refs

0.83

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Video Analysis and Summarization

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Natural Language Processing Techniques

Physical Sciences → Computer Science → Artificial Intelligence

OpenVIS: Open-vocabulary Video Instance Segmentation

Abstract

Metrics

Citation History

Topics

Related Documents

Towards Open-Vocabulary Video Instance Segmentation

OV-VIS: Open-Vocabulary Video Instance Segmentation

Towards Real-Time Open-Vocabulary Video Instance Segmentation

Unified Embedding Alignment for Open-Vocabulary Video Instance Segmentation

Aligning Instance Brownian Bridge with Texts for Open-Vocabulary Video Instance Segmentation