JOURNAL ARTICLE

SLVP: Self-Supervised Language-Video Pre-Training for Referring Video Object Segmentation

Abstract

The referring video object segmentation (R-VOS) task requires a model to understand both referring expression and video input. Most recent works are mainly based on an encoder-decoder type of architecture. Although their text and visual encoders can benefit from separately pretrained backbones, their decoder is trained from scratch on a combination of image/video segmentation datasets. However, pixel-wise annotation with referring expressions is extremely expensive which makes it challenging to further improve the performance. Due to the same reason, current vision-language pre-training works mainly focus on learning general feature representations for image-level or object-level tasks, which may be not optimal for the down-stream pixel-level segmentation task. To bridge this gap, we present a general self-supervised language-video pre-training (SLVP) architecture. With the relatively cheap video caption dataset, SLVP can learn pixel-level features by introducing optical flow as the intermediate target during pre-training. Correspondingly, we propose simple transfer learning models that can reuse pre-trained modules for the downstream R-VOS task. Furthermore, the proposed general SLVP architecture can support either 'language as query' fusion or 'vision as query' fusion. Experiments show the superiority of the under-studied 'vision as query' method which can achieve better performance than the state-of-the-art methods on Ref-Davis17 and Ref-Youtube-VOS benchmarks even with fewer model parameters. We further adopt the challenging VISOR benchmark to the R-VOS task and our SLVP serves as the first strong baseline for R-VOS task on it.

Keywords:
Computer science Artificial intelligence Segmentation Object (grammar) Computer vision Training (meteorology) Natural language processing Speech recognition Geography

Metrics

5
Cited By
2.65
FWCI (Field Weighted Citation Impact)
56
Refs
0.82
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Advanced Image and Video Retrieval Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Domain Adaptation and Few-Shot Learning
Physical Sciences →  Computer Science →  Artificial Intelligence

Related Documents

JOURNAL ARTICLE

Language as Queries for Referring Video Object Segmentation

Jiannan WuYi JiangPeize SunZehuan YuanPing Luo

Journal:   2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Year: 2022 Pages: 4964-4974
JOURNAL ARTICLE

Weakly Supervised Referring Video Object Segmentation With Object-Centric Pseudo-Guidance

Weikang WangYuting SuJing LiuWei SunGuangtao Zhai

Journal:   IEEE Transactions on Multimedia Year: 2024 Vol: 27 Pages: 1320-1333
BOOK-CHAPTER

Video Object Segmentation with Referring Expressions

Anna KhorevaAnna RohrbachBernt Schiele

Lecture notes in computer science Year: 2019 Pages: 7-12
© 2026 ScienceGate Book Chapters — All rights reserved.