JOURNAL ARTICLE

Spectrum-guided Multi-granularity Referring Video Object Segmentation

Abstract

Current referring video object segmentation (R-VOS) techniques extract conditional kernels from encoded (low-resolution) vision-language features to segment the decoded high-resolution features. We discovered that this causes significant feature drift, which the segmentation kernels struggle to perceive during the forward computation. This negatively affects the ability of segmentation kernels. To address the drift problem, we propose a Spectrum-guided Multi-granularity (SgMg) approach, which performs direct segmentation on the encoded features and employs visual details to further optimize the masks. In addition, we propose Spectrum-guided Cross-modal Fusion (SCF) to perform intra-frame global interactions in the spectral domain for effective multimodal representation. Finally, we extend SgMg to perform multi-object R-VOS, a new paradigm that enables simultaneous segmentation of multiple referred objects in a video. This not only makes R-VOS faster, but also more practical. Extensive experiments show that SgMg achieves state-of-the-art performance on four video benchmark datasets, outperforming the nearest competitor by 2.8% points on Ref-YouTube-VOS. Our extended SgMg enables multi-object R-VOS, runs about 3 faster while maintaining satisfactory performance. Code×is available at https://github.com/bo-miao/SgMg.

Keywords:
Granularity Computer science Segmentation Computer vision Artificial intelligence Object (grammar) Image segmentation

Metrics

39
Cited By
7.10
FWCI (Field Weighted Citation Impact)
86
Refs
0.97
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Advanced Image and Video Retrieval Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Video Surveillance and Tracking Methods
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition

Related Documents

JOURNAL ARTICLE

Multi-Granularity Video Object Segmentation

Sangbeom LimSeongchan KimSeung-Jun AnSeung Yull ChoPaul Hongsuck SeoSeungryong Kim

Journal:   Proceedings of the AAAI Conference on Artificial Intelligence Year: 2025 Vol: 39 (5)Pages: 5200-5208
JOURNAL ARTICLE

Co-saliency guided multi-modal learning for referring video object segmentation

Ying TongXiangfeng LuoLiyan MaShaorong Xie

Journal:   Knowledge-Based Systems Year: 2025 Vol: 324 Pages: 113786-113786
JOURNAL ARTICLE

CLUE: Contrastive language-guided learning for referring video object segmentation

Qiqi GaoWanjun ZhongJie LiTiejun Zhao

Journal:   Pattern Recognition Letters Year: 2023 Vol: 178 Pages: 115-121
JOURNAL ARTICLE

Multi-Attention Network for Compressed Video Referring Object Segmentation

Weidong ChenDexiang HongYuankai QiZhenjun HanShuhui WangLaiyun QingQingming HuangGuorong Li

Journal:   Proceedings of the 30th ACM International Conference on Multimedia Year: 2022 Pages: 4416-4425
© 2026 ScienceGate Book Chapters — All rights reserved.