JOURNAL ARTICLE

Structural and Contrastive Guidance Mining for Weakly-Supervised Language Moment Localization

Dongjie TangXiao‐Jie Cao

Year: 2024 Journal:   IEEE Access Vol: 12 Pages: 129290-129301   Publisher: Institute of Electrical and Electronics Engineers

Abstract

Weakly supervised temporal video grounding focuses on localizing the temporal moment or segment corresponding to a sentence query in an untrimmed, long video with only video-level annotations. However, due to the lack of ground moment annotation, current methods suffer from several issues, such as the uncertainty of event starting/ending points and incomplete semantic matching with the sentence. Based on these challenges, we innovate our model. To reduce learning uncertainty and localize the moment more accurately, we calculate the matching score curve between each video frame and the sentence query. Using this matching score curve, we create pseudo ground truth to supervise the localization network. To achieve complete semantic matching with the sentence semantics, we propose a semantic prediction module based on matched video-sentence pairs and a semantic contrastive training strategy for unmatched pairs. Lastly, to improve model accuracy, we construct several contrastive samples that contain similar but different semantics in the semantic contrastive training strategy. This helps in learning different semantics and achieving complete semantic matching. We conduct extensive experiments on the Charades-STA, ActivityNet Captions, and DiDeMo datasets. The results demonstrate that our proposed method significantly outperforms the state-of-the-art by more than 10% in terms of mean Intersection over Union (mIoU) when ranging from 0.6 to 0.8, and by more than 30% when IoU equals 0.7. The code is publicly available at https://github.com/anonymousabca/WLML.

Keywords:
Computer science Artificial intelligence Natural language processing Moment (physics) Physics

Metrics

0
Cited By
0.00
FWCI (Field Weighted Citation Impact)
65
Refs
0.12
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Topics

Natural Language Processing Techniques
Physical Sciences →  Computer Science →  Artificial Intelligence
Speech and dialogue systems
Physical Sciences →  Computer Science →  Artificial Intelligence
Video Analysis and Summarization
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
© 2026 ScienceGate Book Chapters — All rights reserved.