Semi-supervised video object segmentation (VOS) methods aim to segment target objects with the help of pixel-level annotations in the first frame. Many methods employ Transformer-based attention modules to propagate the given annotations in the first frame to the most similar patch or pixel in the following frames. Although they have shown impressive results, they can still be prone to errors in challenging scenes with multiple overlapping objects. To tackle this problem, we propose an object-centric VOS (OCVOS) method that exploits query-based Transformer decoder blocks. After aggregating target object information with typical matching-based approaches, the Transformer networks extract object-wise information by interacting with object queries. In this way, the proposed method considers not only global and contextual information but also object-centric representations. We validate its effectiveness in inducing object-wise information compared to existing methods on the DAVIS and YouTube-VOS benchmarks.
Yi ZhouHui ZhangHana LeeShuyang SunPingjun LiYangguang ZhuByungIn YooXiaojuan QiJae‐Joon Han
Weikang WangYuting SuJing LiuWei SunGuangtao Zhai
Qihao LiuJunfeng WuYi JiangXiang BaiAlan YuilleSong Bai
Yi ZhouHui ZhangSeung-In ParkByungIn YooXiaojuan Qi