Xing WangZhe XuYuanshi ZhengHanding Wang
Abstract Referring video object segmentation (RVOS) aims to segment the object corresponding to a language expression in a video. Most existing RVOS methods are trained using accurate per-pixel annotations, which are expensive and time-consuming to obtain. Moreover, they need to update the entire parameter of a segmentation model, making it inefficient to train as the model scale increases. In this paper, we propose a novel parameter-efficient framework under weak supervision, dubbed ReferringAdapter, to ameliorate both of issues. Specifically, we propose to adapt an off-the-shelf image segmentation model for RVOS by plugging a small set of trained parameters, i.e., an adapter, into the intermediate layer. This efficiently endows a uni-modal image segmentation model with the cross-modal ability to segment the video object referred by a language expression. To update the adapter parameters under weak supervision, instead of directly fuse the video and sentence-level language features, we propose chain-of-thought reasoning to consider the intermediate steps along the thought process. Extensive experiments demonstrate that training the adapter with 1.1% of total parameters can outperform previous weakly supervised methods by 11.6 $$-$$ - 15.3 mAP and achieve comparable performance with fully supervised ones.
XiaoQing BuYukuan SunJianming WangKunliang LiuJiayu LiangGuanghao JinTae‐Sun Chung
Weikang WangYuting SuJing LiuWei SunGuangtao Zhai
Yufei WangYongjiang HuAlan Wee‐Chung LiewJunhu Wang
Jinyu YangMingqi GaoFeng ZhengXiantong ZhenRongrong JiLing ShaoAleš Leonardis
Jie MeiAJ PiergiovanniJenq–Neng HwangWei Li