Although deep learning has achieved remarkable success, it still falls short in robustness, systematic generalization, interpretability, reasoning, and creating new knowledge from limited experience. Addressing these limitations requires learning representations that understand the underlying causal structure of the data. A key step in this direction is discovering hidden generative causal variables, such as objects and other scene factors. This dissertation develops architectures and algorithms to infer object-centric representations of visual scenes without human supervision or labels. Building on the idea of perception as inverse graphics, existing approaches rely on inverting renderers that are brittle, cumbersome, and limited to simple visual scenes. In Part One, we propose, for the first time, the idea of taking an expressive decoder and inverting it to learn object-centric representations. We show that this achieves an unprecedented scene decomposition ability in visually complex scenes. It gracefully handles aspects of raytracing like shadows and reflections that are poorly handled by existing decoders. We also show evidence of systematic generalization by decoding novel object combinations. Next, to extend these benefits from images to videos, we explore two routes: a recurrent route and a parallelizable route; and analyze their trade-offs. In Part Two, we build on our previous success and move beyond monolithic object representations. We introduce a novel method that discovers not only objects but also intra-object factors, crucially, for the first time in complex scenes.
Donghun LeeSamyeul NohIngook JangSeonghyun KimSoon-Young SongHeechul Bae
Yi ZhouHui ZhangSeung-In ParkByungIn YooXiaojuan Qi
Lauren E. WelbourneBarry GiesbrechtMiguel P. Eckstein
Zong-Ying ShenShiang-Yu HanLi‐Chen FuPei‐Yung HsiaoYo-Chung LauSheng-Jen Chang