Yingqi GaoZhiling LuoShiqian ChenWei Zhou
In this paper, we consider a novel task, Video Corpus Spatio-Temporal Grounding (VCSTG) for material selection and spatio-temporal adaption in intelligent video editing. Given a text query depicting an object and a corpus of untrimmed and unsegmented videos, VCSTG aims to localize a sequence of spatio-temporal object tubes from the video corpus. Existing methods tackle the VCSTG task in a multi-stage approach, which encodes the query and video representation independently for each task, leading to local optimum. In this paper, we propose a novel one-stage multi-task learning based framework named MTSTG for the VCSTG task. MTSTG learns unified query and video representation for video retrieval, temporal grounding and spatial grounding tasks. Video-level, frame-level and object-level contrastive learning are introduced to measure the mutual information between query and video at different granularity. Comprehensive experiments demonstrate our newly proposed framework outperforms the state-of-the-art multi-stage methods on VidSTG dataset.
Wenxuan GuoShuo DuHuiyuan DengZikang YuLin Feng
Yaoyuan LiangXiao LiangYansong TangZhao YangZiran LiJingang WangWenbo DingShao‐Lun Huang
Jorge PessoaHelena AidosPedro TomásMário A. T. Figueiredo
Mengze LiTianbao WangHaoyu ZhangShengyu ZhangZhou ZhaoWenqiao ZhangJiaxu MiaoShiliang PuFei Wu
Alexey A. GritsenkoXuehan XiongJosip DjolongaMostafa DehghaniChen SunMario LučićCordelia SchmidAnurag Arnab