Andra PetrovaiSergiu Nedevschi
We propose a novel solution for the task of video panoptic segmentation, that\nsimultaneously predicts pixel-level semantic and instance segmentation and\ngenerates clip-level instance tracks. Our network, named VPS-Transformer, with\na hybrid architecture based on the state-of-the-art panoptic segmentation\nnetwork Panoptic-DeepLab, combines a convolutional architecture for\nsingle-frame panoptic segmentation and a novel video module based on an\ninstantiation of the pure Transformer block. The Transformer, equipped with\nattention mechanisms, models spatio-temporal relations between backbone output\nfeatures of current and past frames for more accurate and consistent panoptic\nestimates. As the pure Transformer block introduces large computation overhead\nwhen processing high resolution images, we propose a few design changes for a\nmore efficient compute. We study how to aggregate information more effectively\nover the space-time volume and we compare several variants of the Transformer\nblock with different attention schemes. Extensive experiments on the\nCityscapes-VPS dataset demonstrate that our best model improves the temporal\nconsistency and video panoptic quality by a margin of 2.2%, with little extra\ncomputation.\n
Dahun KimSanghyun WooJoon‐Young LeeIn So Kweon
Zhiqi LiWenhai WangEnze XieZhiding YuAnima AnandkumarJosé M. AlvarezPing LuoTong Lu
Qihang YuHuiyu WangDahun KimSiyuan QiaoMaxwell D. CollinsYukun ZhuHartwig AdamAlan YuilleLiang-Chieh Chen
Young S. ParkDuc Trong TranMinho KimHyeonseok KimYeejin Lee