Weakly supervised temporal video grounding focuses on localizing the temporal moment or segment corresponding to a sentence query in an untrimmed, long video with only video-level annotations. However, due to the lack of ground moment annotation, current methods suffer from several issues, such as the uncertainty of event starting/ending points and incomplete semantic matching with the sentence. Based on these challenges, we innovate our model. To reduce learning uncertainty and localize the moment more accurately, we calculate the matching score curve between each video frame and the sentence query. Using this matching score curve, we create pseudo ground truth to supervise the localization network. To achieve complete semantic matching with the sentence semantics, we propose a semantic prediction module based on matched video-sentence pairs and a semantic contrastive training strategy for unmatched pairs. Lastly, to improve model accuracy, we construct several contrastive samples that contain similar but different semantics in the semantic contrastive training strategy. This helps in learning different semantics and achieving complete semantic matching. We conduct extensive experiments on the Charades-STA, ActivityNet Captions, and DiDeMo datasets. The results demonstrate that our proposed method significantly outperforms the state-of-the-art by more than 10% in terms of mean Intersection over Union (mIoU) when ranging from 0.6 to 0.8, and by more than 30% when IoU equals 0.7. The code is publicly available at https://github.com/anonymousabca/WLML.
Minghang ZhengYanjie HuangQing-Chao ChenYang Liu
Meng LiuYupeng HuWeili GuanLiqiang Nie
Yenan XuWanru XuZhenjiang Miao
Tingting HanKai WangJun YuJianping Fan