Reinforcement Learning for Weakly Supervised Temporal Grounding of Natural Language in Untrimmed Videos

Jie Wu; Guanbin Li; Xiaoguang Han; Liang Lin

doi:10.1145/3394171.3413862

ScienceGate Book Chapters

JOURNAL ARTICLE

Reinforcement Learning for Weakly Supervised Temporal Grounding of Natural Language in Untrimmed Videos

Jie Wu Guanbin Li Xiaoguang Han Liang Lin

Year: 2020 Pages: 1283-1291

DOI: 10.1145/3394171.3413862

Get Full-Text PDF Get Analytical Report

Abstract

Temporal grounding of natural language in untrimmed videos is a fundamental yet challenging multimedia task facilitating cross-media visual content retrieval. We focus on the weakly supervised setting of this task that merely accesses to coarse video-level language description annotation without temporal boundary, which is more consistent with reality as such weak labels are more readily available in practice. In this paper, we propose a Boundary Adaptive Refinement (BAR) framework that resorts to reinforcement learning (RL) to guide the process of progressively refining the temporal boundary. To the best of our knowledge, we offer the first attempt to extend RL to temporal localization task with weak supervision. As it is non-trivial to obtain a straightforward reward function in the absence of pairwise granular boundary-query annotations, a cross-modal alignment evaluator is crafted to measure the alignment degree of segment-query pair to provide tailor-designed rewards. This refinement scheme completely abandons traditional sliding window based solution pattern and contributes to acquiring more efficient, boundary-flexible and content-aware grounding results. Extensive experiments on two public benchmarks Charades-STA and ActivityNet demonstrate that BAR outperforms the state-of-the-art weakly-supervised method and even beats some competitive fully-supervised ones.

Keywords:

Reinforcement learning Computer science Pairwise comparison Task (project management) Artificial intelligence Boundary (topology) Natural language Natural language processing Supervised learning Measure (data warehouse) Annotation Machine learning Data mining Mathematics

Metrics

Cited By

4.30

FWCI (Field Weighted Citation Impact)

Refs

0.95

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Human Pose and Action Recognition

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Video Analysis and Summarization

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Reinforcement Learning for Weakly Supervised Temporal Grounding of Natural Language in Untrimmed Videos

Abstract

Metrics

Citation History

Topics

Related Documents

Activity-driven Weakly-Supervised Spatio-Temporal Grounding from Untrimmed Videos

AutoLoc: Weakly-Supervised Temporal Action Localization in Untrimmed Videos

Weakly Supervised Temporal Adjacent Network for Language Grounding

WOAD: Weakly Supervised Online Action Detection in Untrimmed Videos

Weakly-Supervised Temporal Article Grounding