Activity-driven Weakly-Supervised Spatio-Temporal Grounding from Untrimmed Videos

Junwen Chen; Wentao Bao; Yu Xiang George Kong

doi:10.1145/3394171.3413614

ScienceGate Book Chapters

JOURNAL ARTICLE

Activity-driven Weakly-Supervised Spatio-Temporal Grounding from Untrimmed Videos

Junwen Chen Wentao Bao Yu Xiang George Kong

Year: 2020 Pages: 3789-3797

DOI: 10.1145/3394171.3413614

Get Full-Text PDF Get Analytical Report

Abstract

In this paper, we study the problem of weakly-supervised spatio-temporal grounding from raw untrimmed video streams. Given a video and its descriptive sentence, spatio-temporal grounding aims at predicting the temporal occurrence and spatial locations of each query object across frames. Our goal is to learn a grounding model in a weakly-supervised fashion, without the supervision of both spatial bounding boxes and temporal occurrences during training. Existing methods have been addressed in trimmed videos, but their reliance on object tracking will easily fail due to frequent camera shot cut in untrimmed videos. To this end, we propose a novel spatio-temporal multiple instance learning framework for untrimmed video grounding. Spatial MIL and temporal MIL are mutually guided to ground each query to specific spatial regions and the occurring frames of a video. Furthermore, an activity described in the sentence is captured to use the informative contextual cues for region proposals refinement and text representation. We conduct extensive evaluation on YouCookII and RoboWatch datasets, and demonstrate our method outperforms state-of-the-art methods.

Keywords:

Computer science Artificial intelligence Object (grammar) Bounding overwatch Sentence Representation (politics) Video tracking Ground truth Pattern recognition (psychology) Tracking (education) Computer vision Machine learning

Metrics

Cited By

1.05

FWCI (Field Weighted Citation Impact)

Refs

0.79

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Human Pose and Action Recognition

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Domain Adaptation and Few-Shot Learning

Physical Sciences → Computer Science → Artificial Intelligence

Activity-driven Weakly-Supervised Spatio-Temporal Grounding from Untrimmed Videos

Abstract

Metrics

Citation History

Topics

Related Documents

Reinforcement Learning for Weakly Supervised Temporal Grounding of Natural Language in Untrimmed Videos

AutoLoc: Weakly-Supervised Temporal Action Localization in Untrimmed Videos

What, When, and Where? Self-Supervised Spatio- Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions

Spatio-Temporal Activity Detection and Recognition in Untrimmed Surveillance Videos

Spatio-temporal activity detection and recognition in untrimmed surveillance videos