Parameter-efficient weakly supervised referring video object segmentation via chain-of-thought reasoning

Xing Wang; Zhe Xu; Yuanshi Zheng; Handing Wang

doi:10.1007/s40747-025-01900-1

ScienceGate Book Chapters

JOURNAL ARTICLE

Parameter-efficient weakly supervised referring video object segmentation via chain-of-thought reasoning

Xing Wang Zhe Xu Yuanshi Zheng Handing Wang

Year: 2025 Journal: Complex & Intelligent Systems Vol: 11 (6) Publisher: Springer Science+Business Media

DOI: 10.1007/s40747-025-01900-1

Get Full-Text PDF Get Analytical Report

Abstract

Abstract Referring video object segmentation (RVOS) aims to segment the object corresponding to a language expression in a video. Most existing RVOS methods are trained using accurate per-pixel annotations, which are expensive and time-consuming to obtain. Moreover, they need to update the entire parameter of a segmentation model, making it inefficient to train as the model scale increases. In this paper, we propose a novel parameter-efficient framework under weak supervision, dubbed ReferringAdapter, to ameliorate both of issues. Specifically, we propose to adapt an off-the-shelf image segmentation model for RVOS by plugging a small set of trained parameters, i.e., an adapter, into the intermediate layer. This efficiently endows a uni-modal image segmentation model with the cross-modal ability to segment the video object referred by a language expression. To update the adapter parameters under weak supervision, instead of directly fuse the video and sentence-level language features, we propose chain-of-thought reasoning to consider the intermediate steps along the thought process. Extensive experiments demonstrate that training the adapter with 1.1% of total parameters can outperform previous weakly supervised methods by 11.6 $$-$$ - 15.3 mAP and achieve comparable performance with fully supervised ones.

Keywords:

Computational intelligence Object (grammar) Segmentation Artificial intelligence Chain (unit) Computer science Pattern recognition (psychology) Natural language processing Computer vision Physics

Metrics

Cited By

4.77

FWCI (Field Weighted Citation Impact)

Refs

0.84

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Human Pose and Action Recognition

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Video Analysis and Summarization

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Parameter-efficient weakly supervised referring video object segmentation via chain-of-thought reasoning

Abstract

Metrics

Citation History

Topics

Related Documents

Weakly supervised video object segmentation initialized with referring expression

Weakly Supervised Referring Video Object Segmentation With Object-Centric Pseudo-Guidance

Weakly Supervised Video Object Segmentation

Weakly-Supervised RGBD Video Object Segmentation

SLVP: Self-Supervised Language-Video Pre-Training for Referring Video Object Segmentation