We tackle the problem of semantic segmentation of dynamic scene in video sequences. We propose to incorporate foreground object information into pixel labeling by jointly reasoning semantic labels of super-voxels, object instance tracks and geometric relations between objects. We take an exemplar approach to object modeling by using a small set of object annotations and exploring the temporal consistency of object motion. After generating a set of moving object hypotheses, we design a CRF framework that jointly models the super voxel and object instances. The optimal semantic labeling is inferred by the MAP estimation of the model, which is solved by a single move-making based optimization procedure. We demonstrate the effectiveness of our method on three public datasets and show that our model can achieve superior or comparable results than the state of-the-art with less object-level supervision
Rémi VieuxJenny Benois‐PineauJean‐Philippe DomengerAchille Braquelaire
Albert Y. C. ChenJason J. Corso
Dongming WuXingping DongLing ShaoJianbing Shen
Yong LiuZhuoyan LuoYicheng XiaoYitong WangShuyan LiXiu LiYujiu YangYansong Tang