The referring video object segmentation (R-VOS) task requires a model to understand both referring expression and video input. Most recent works are mainly based on an encoder-decoder type of architecture. Although their text and visual encoders can benefit from separately pretrained backbones, their decoder is trained from scratch on a combination of image/video segmentation datasets. However, pixel-wise annotation with referring expressions is extremely expensive which makes it challenging to further improve the performance. Due to the same reason, current vision-language pre-training works mainly focus on learning general feature representations for image-level or object-level tasks, which may be not optimal for the down-stream pixel-level segmentation task. To bridge this gap, we present a general self-supervised language-video pre-training (SLVP) architecture. With the relatively cheap video caption dataset, SLVP can learn pixel-level features by introducing optical flow as the intermediate target during pre-training. Correspondingly, we propose simple transfer learning models that can reuse pre-trained modules for the downstream R-VOS task. Furthermore, the proposed general SLVP architecture can support either 'language as query' fusion or 'vision as query' fusion. Experiments show the superiority of the under-studied 'vision as query' method which can achieve better performance than the state-of-the-art methods on Ref-Davis17 and Ref-Youtube-VOS benchmarks even with fewer model parameters. We further adopt the challenging VISOR benchmark to the R-VOS task and our SLVP serves as the first strong baseline for R-VOS task on it.
XiaoQing BuYukuan SunJianming WangKunliang LiuJiayu LiangGuanghao JinTae‐Sun Chung
Jiannan WuYi JiangPeize SunZehuan YuanPing Luo
Weikang WangYuting SuJing LiuWei SunGuangtao Zhai
Ruxue YanWenya GuoXubo LiuXumeng LiuYing ZhangXiaojie Yuan
Anna KhorevaAnna RohrbachBernt Schiele