WINNER: Weakly-supervised hIerarchical decompositioN and aligNment for spatio-tEmporal video gRounding

Mengze Li; Han Wang; Wenqiao Zhang; Jiaxu Miao; Zhou Zhao; Shengyu Zhang; Wei Ji; Fei Wu

doi:10.1109/cvpr52729.2023.02211

ScienceGate Book Chapters

JOURNAL ARTICLE

WINNER: Weakly-supervised hIerarchical decompositioN and aligNment for spatio-tEmporal video gRounding

Mengze Li Han Wang Wenqiao Zhang Jiaxu Miao Zhou Zhao Shengyu Zhang Wei Ji Fei Wu

Year: 2023 Pages: 23090-23099

DOI: 10.1109/cvpr52729.2023.02211

Get Full-Text PDF Get Analytical Report

Abstract

Spatio-temporal video grounding aims to localize the aligned visual tube corresponding to a language query. Existing techniques achieve such alignment by exploiting dense boundary and bounding box annotations, which can be prohibitively expensive. To bridge the gap, we investigate the weakly-supervised setting, where models learn from easily accessible video-language data without annotations. We identify that intra-sample spurious correlations among video-language components can be alleviated if the model captures the decomposed structures of video and language data. In this light, we propose a novel framework, namely WINNER, for hierarchical video-text understanding. WINNER first builds the language decomposition tree in a bottom-up manner, upon which the structural attention mechanism and top-down feature backtracking jointly build a multi-modal decomposition tree, permitting a hierarchical understanding of unstructured videos. The multi-modal decomposition tree serves as the basis for multi-hierarchy language-tube matching. A hierarchical contrastive learning objective is proposed to learn the multi-hierarchy correspondence and distinguishment with intra-sample and inter-sample video-text decomposition structures, achieving video-language decomposition structure alignment. Extensive experiments demonstrate the rationality of our design and its effectiveness beyond state-of-the-art weakly supervised methods, even some supervised methods.

Keywords:

Computer science Artificial intelligence Hierarchy Tree (set theory) Sample (material) Tree structure Language model Machine learning Pattern recognition (psychology) Algorithm Binary tree

Metrics

Cited By

6.19

FWCI (Field Weighted Citation Impact)

Refs

0.96

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Human Pose and Action Recognition

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Domain Adaptation and Few-Shot Learning

Physical Sciences → Computer Science → Artificial Intelligence

WINNER: Weakly-supervised hIerarchical decompositioN and aligNment for spatio-tEmporal video gRounding

Abstract

Metrics

Citation History

Topics

Related Documents

Weakly-Supervised Spatio-Temporal Video Grounding with Variational Cross-Modal Alignment

Video-Text Prompting for Weakly Supervised Spatio-Temporal Video Grounding

Weakly-Supervised Video Object Grounding by Exploring Spatio-Temporal Contexts

Activity-driven Weakly-Supervised Spatio-Temporal Grounding from Untrimmed Videos

STPro: Spatial and Temporal Progressive Learning for Weakly Supervised Spatio-Temporal Grounding