JOURNAL ARTICLE

Object-Centric Representation Learning for Video Scene Understanding

Yi ZhouHui ZhangSeung-In ParkByungIn YooXiaojuan Qi

Year: 2024 Journal:   IEEE Transactions on Pattern Analysis and Machine Intelligence Vol: 46 (12)Pages: 8410-8423   Publisher: IEEE Computer Society

Abstract

Depth-aware Video Panoptic Segmentation (DVPS) is a challenging task that requires predicting the semantic class and 3D depth of each pixel in a video, while also segmenting and consistently tracking objects across frames. Predominant methodologies treat this as a multi-task learning problem, tackling each constituent task independently, thus restricting their capacity to leverage interrelationships amongst tasks and requiring parameter tuning for each task. To surmount these constraints, we present Slot-IVPS, a new approach employing an object-centric model to acquire unified object representations, thereby facilitating the model's ability to simultaneously capture semantic and depth information. Specifically, we introduce a novel representation, Integrated Panoptic Slots (IPS), to capture both semantic and depth information for all panoptic objects within a video, encompassing background semantics and foreground instances. Subsequently, we propose an integrated feature generator and enhancer to extract depth-aware features, alongside the Integrated Video Panoptic Retriever (IVPR), which iteratively retrieves spatial-temporal coherent object features and encodes them into IPS. The resulting IPS can be effortlessly decoded into an array of video outputs, including depth maps, classifications, masks, and object instance IDs. We undertake comprehensive analyses across four datasets, attaining state-of-the-art performance in both Depth-aware Video Panoptic Segmentation and Video Panoptic Segmentation tasks.

Keywords:
Artificial intelligence Computer science Computer vision Representation (politics) Object (grammar) Video tracking

Metrics

1
Cited By
0.53
FWCI (Field Weighted Citation Impact)
62
Refs
0.51
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Advanced Image and Video Retrieval Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Human Pose and Action Recognition
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition

Related Documents

JOURNAL ARTICLE

Neural object-centric scene representation and generation

Singh, Gautam

Journal:   Rutgers University Community Repository (Rutgers University) Year: 2025
JOURNAL ARTICLE

Slot-VPS: Object-centric Representation Learning for Video Panoptic Segmentation

Yi ZhouHui ZhangHana LeeShuyang SunPingjun LiYangguang ZhuByungIn YooXiaojuan QiJae‐Joon Han

Journal:   2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Year: 2022 Pages: 3083-3093
JOURNAL ARTICLE

Unsupervised Learning of Global Object-Centric Representations for Compositional Scene Understanding

Tonglin ChenYinxuan HuangJinghao HuangBin LiXiangyang Xue

Journal:   IEEE Transactions on Visualization and Computer Graphics Year: 2025 Vol: 31 (10)Pages: 8385-8396
© 2026 ScienceGate Book Chapters — All rights reserved.