JOURNAL ARTICLE

OCVOS: Object-Centric Representation for Video Object Segmentation

Abstract

Semi-supervised video object segmentation (VOS) methods aim to segment target objects with the help of pixel-level annotations in the first frame. Many methods employ Transformer-based attention modules to propagate the given annotations in the first frame to the most similar patch or pixel in the following frames. Although they have shown impressive results, they can still be prone to errors in challenging scenes with multiple overlapping objects. To tackle this problem, we propose an object-centric VOS (OCVOS) method that exploits query-based Transformer decoder blocks. After aggregating target object information with typical matching-based approaches, the Transformer networks extract object-wise information by interacting with object queries. In this way, the proposed method considers not only global and contextual information but also object-centric representations. We validate its effectiveness in inducing object-wise information compared to existing methods on the DAVIS and YouTube-VOS benchmarks.

Keywords:
Computer science Artificial intelligence Computer vision Segmentation Object (grammar) Transformer Pixel Exploit Representation (politics) Frame (networking)

Metrics

1
Cited By
0.18
FWCI (Field Weighted Citation Impact)
25
Refs
0.42
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Visual Attention and Saliency Detection
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Advanced Neural Network Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Advanced Image and Video Retrieval Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
© 2026 ScienceGate Book Chapters — All rights reserved.