Yaoyuan LiangXiao LiangYansong TangZhao YangZiran LiJingang WangWenbo DingShao‐Lun Huang
This paper studies the spatio-temporal video grounding task, which aims to localize a spatio-temporal tube in an untrimmed video based on the given text description of an event. Existing one-stage approaches suffer from insufficient space-time interaction in two aspects: i) less precise prediction of event temporal boundaries, and ii) inconsistency in object prediction for the same event across adjacent frames. To address these issues, we propose a framework of Comprehensive Space-Time entAnglement (CoSTA) to densely entangle space-time multi-modal features for spatio-temporal localization. Specifically, we propose a space-time collaborative encoder to extract comprehensive video features and leverage Transformer to perform spatio-temporal multi-modal understanding. Our entangled decoder couples temporal boundary prediction and spatial localization via an entangled query, boasting an enhanced ability to capture object-event relationships. We conduct extensive experiments on the challenging benchmarks of HC-STVG and VidSTG, where CoSTA outperforms existing state-of-the-art methods, demonstrating its effectiveness for this task.
Yingqi GaoZhiling LuoShiqian ChenWei Zhou
Mengze LiTianbao WangHaoyu ZhangShengyu ZhangZhou ZhaoWenqiao ZhangJiaxu MiaoShiliang PuFei Wu
Alexey A. GritsenkoXuehan XiongJosip DjolongaMostafa DehghaniChen SunMario LučićCordelia SchmidAnurag Arnab
Ruyi ChangHaopeng WangZhitian ZhangDejiao HuangShuai Guo
Jorge PessoaHelena AidosPedro TomásMário A. T. Figueiredo