Abstract

Given a long untrimmed video and natural language queries, video grounding (VG) aims to temporally localize the semantically-aligned video segments. Almost all existing VG work holds two simple but unrealistic assumptions: 1) All query sentences can be grounded in the corresponding video. 2) All query sentences for the same video are always at the same semantic scale. Unfortunately, both assumptions make today's VG models fail to work in practice. For example, in real-world multimodal assets (e.g., news articles), most of the sentences in the article can not be grounded in their affiliated videos, and they typically have rich hierarchical relations (i.e., at different semantic scales). To this end, we propose a new challenging grounding task: Weakly-Supervised temporal Article Grounding (WSAG). Specifically, given an article and a relevant video, WSAG aims to localize all “groundable” sentences to the video, and these sentences are possibly at different semantic scales. Accordingly, we collect the first WSAG dataset to facilitate this task: YouwikiHow, which borrows the inherent multi-scale descriptions in wikiHow articles and plentiful YouTube videos. In addition, we propose a simple but effective method DualMIL for WSAG, which consists of a two-level MIL loss and a single-/cross- sentence constraint loss. These training objectives are carefully designed for these relaxed assumptions. Extensive ablations have verified the effectiveness of DualMIL.

Keywords:
Chen Computer science Artificial intelligence Natural language processing Cognitive science Geology Psychology Paleontology

Metrics

11
Cited By
2.15
FWCI (Field Weighted Citation Impact)
50
Refs
0.85
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Natural Language Processing Techniques
Physical Sciences →  Computer Science →  Artificial Intelligence
Topic Modeling
Physical Sciences →  Computer Science →  Artificial Intelligence
Speech and dialogue systems
Physical Sciences →  Computer Science →  Artificial Intelligence

Related Documents

JOURNAL ARTICLE

Weakly Supervised Temporal Adjacent Network for Language Grounding

Yuechen WangJiajun DengWengang ZhouHouqiang Li

Journal:   IEEE Transactions on Multimedia Year: 2021 Vol: 24 Pages: 3276-3286
JOURNAL ARTICLE

Weakly Supervised Temporal Sentence Grounding via Positive Sample Mining

Lu DongHaiyu ZhangHongjie ZhangYifei HuangZhen-Hua LingYu QiaoLimin WangYali Wang

Journal:   IEEE Transactions on Circuits and Systems for Video Technology Year: 2025 Vol: 35 (10)Pages: 10396-10409
JOURNAL ARTICLE

Counterfactual contrastive learning for weakly supervised temporal sentence grounding

Yenan XuWanru XuZhenjiang Miao

Journal:   Neurocomputing Year: 2025 Vol: 624 Pages: 129508-129508
© 2026 ScienceGate Book Chapters — All rights reserved.