SLVP: Self-Supervised Language-Video Pre-Training for Referring Video Object Segmentation

Jie Mei; AJ Piergiovanni; Jenq–Neng Hwang; Wei Li

doi:10.1109/wacvw60836.2024.00061

ScienceGate Book Chapters

JOURNAL ARTICLE

SLVP: Self-Supervised Language-Video Pre-Training for Referring Video Object Segmentation

Jie Mei AJ Piergiovanni Jenq–Neng Hwang Wei Li

Year: 2024 Pages: 507-517

DOI: 10.1109/wacvw60836.2024.00061

Get Full-Text PDF Get Analytical Report

Abstract

The referring video object segmentation (R-VOS) task requires a model to understand both referring expression and video input. Most recent works are mainly based on an encoder-decoder type of architecture. Although their text and visual encoders can benefit from separately pretrained backbones, their decoder is trained from scratch on a combination of image/video segmentation datasets. However, pixel-wise annotation with referring expressions is extremely expensive which makes it challenging to further improve the performance. Due to the same reason, current vision-language pre-training works mainly focus on learning general feature representations for image-level or object-level tasks, which may be not optimal for the down-stream pixel-level segmentation task. To bridge this gap, we present a general self-supervised language-video pre-training (SLVP) architecture. With the relatively cheap video caption dataset, SLVP can learn pixel-level features by introducing optical flow as the intermediate target during pre-training. Correspondingly, we propose simple transfer learning models that can reuse pre-trained modules for the downstream R-VOS task. Furthermore, the proposed general SLVP architecture can support either 'language as query' fusion or 'vision as query' fusion. Experiments show the superiority of the under-studied 'vision as query' method which can achieve better performance than the state-of-the-art methods on Ref-Davis17 and Ref-Youtube-VOS benchmarks even with fewer model parameters. We further adopt the challenging VISOR benchmark to the R-VOS task and our SLVP serves as the first strong baseline for R-VOS task on it.

Keywords:

Computer science Artificial intelligence Segmentation Object (grammar) Computer vision Training (meteorology) Natural language processing Speech recognition Geography

Metrics

Cited By

2.65

FWCI (Field Weighted Citation Impact)

Refs

0.82

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Advanced Image and Video Retrieval Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Domain Adaptation and Few-Shot Learning

Physical Sciences → Computer Science → Artificial Intelligence

SLVP: Self-Supervised Language-Video Pre-Training for Referring Video Object Segmentation

Abstract

Metrics

Citation History

Topics

Related Documents

Weakly supervised video object segmentation initialized with referring expression

Language as Queries for Referring Video Object Segmentation

Weakly Supervised Referring Video Object Segmentation With Object-Centric Pseudo-Guidance

Tracking-forced Referring Video Object Segmentation

Video Object Segmentation with Referring Expressions