Qingfeng LiuMostafa El‐KhamyKee-Bong Song
Video Panoptic Segmentation (VPS) is the most challenging video segmentation task, as it requires accurate labeling of every pixel in each frame, as well as identifying the multiple instances and tracking them across frames. In this paper, we explore state-of-the-art solutions for VPS at both the giant model regime for offline or server processing and the tiny model regime for online or edge computing. We designed Giant-VPS which achieved the first place solution in the 2024 Pixel Level Video Understanding in the Wild (PVUW) challenge. Our Giant-VPS builds on top of MinVIS and deploys the DINOv2-giant vision foundation model with a carefully designed ViT (Vision Transformer) adapter. For mobile and edge devices, we designed the Tiny-VPS model and show that our novel ViT-adapter distillation from the Giant-VPS model can further improve the accuracy of Tiny-VPS. Our Tiny-VPS is the first, in the sub-20 GFLOPS regime, to achieve competitive accuracy on VPS and VSS (Video Semantic Segmentation) benchmarks.
Yi ZhouHui ZhangHana LeeShuyang SunPingjun LiYangguang ZhuByungIn YooXiaojuan QiJae‐Joon Han
Andre PawlowskiVictor van der VeenDennis AndriesseErik van der KouweThorsten HolzCristiano GiuffridaHerbert Bos
Lothar KrischWolfgang Schneider