JOURNAL ARTICLE

Parameter-efficient weakly supervised referring video object segmentation via chain-of-thought reasoning

Xing WangZhe XuYuanshi ZhengHanding Wang

Year: 2025 Journal:   Complex & Intelligent Systems Vol: 11 (6)   Publisher: Springer Science+Business Media

Abstract

Abstract Referring video object segmentation (RVOS) aims to segment the object corresponding to a language expression in a video. Most existing RVOS methods are trained using accurate per-pixel annotations, which are expensive and time-consuming to obtain. Moreover, they need to update the entire parameter of a segmentation model, making it inefficient to train as the model scale increases. In this paper, we propose a novel parameter-efficient framework under weak supervision, dubbed ReferringAdapter, to ameliorate both of issues. Specifically, we propose to adapt an off-the-shelf image segmentation model for RVOS by plugging a small set of trained parameters, i.e., an adapter, into the intermediate layer. This efficiently endows a uni-modal image segmentation model with the cross-modal ability to segment the video object referred by a language expression. To update the adapter parameters under weak supervision, instead of directly fuse the video and sentence-level language features, we propose chain-of-thought reasoning to consider the intermediate steps along the thought process. Extensive experiments demonstrate that training the adapter with 1.1% of total parameters can outperform previous weakly supervised methods by 11.6 $$-$$ - 15.3 mAP and achieve comparable performance with fully supervised ones.

Keywords:
Computational intelligence Object (grammar) Segmentation Artificial intelligence Chain (unit) Computer science Pattern recognition (psychology) Natural language processing Computer vision Physics

Metrics

1
Cited By
4.77
FWCI (Field Weighted Citation Impact)
66
Refs
0.84
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Human Pose and Action Recognition
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Video Analysis and Summarization
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
© 2026 ScienceGate Book Chapters — All rights reserved.