Wenfei YangTianzhu ZhangYongdong ZhangFeng Wu
Weakly supervised temporal sentence grounding has better scalability and practicability than fully supervised methods in real-world application scenarios. However, most of existing methods cannot model the fine-grained video-text local correspondences well and do not have effective supervision information for correspondence learning, thus yielding unsatisfying performance. To address the above issues, we propose an end-to-end Local Correspondence Network (LCNet) for weakly supervised temporal sentence grounding. The proposed LCNet enjoys several merits. First, we represent video and text features in a hierarchical manner to model the fine-grained video-text correspondences. Second, we design a self-supervised cycle-consistent loss as a learning guidance for video and text matching. To the best of our knowledge, this is the first work to fully explore the fine-grained correspondences between video and text for temporal sentence grounding by using self-supervised learning. Extensive experimental results on two benchmark datasets demonstrate that the proposed LCNet significantly outperforms existing weakly supervised methods.
Tingting HanYuanxin LvYu ZhouJun YuJianping FanYuan Liu
Kefan TangLihuo HeNannan WangXinbo Gao
Yaru ZhangXiaoyu ZhangHaichao Shi
Lu DongHaiyu ZhangHongjie ZhangYifei HuangZhen-Hua LingYu QiaoLimin WangYali Wang
Yaru ZhangHaichao ShiXiaoyu Zhang