Self-Supervised Learning from Untrimmed Videos via Hierarchical Consistency

Zhiwu Qing; Shiwei Zhang; Ziyuan Huang; Yi Xu; Xiang Wang; Changxin Gao; Rong Jin; Nong Sang

doi:10.1109/tpami.2023.3273415

ScienceGate Book Chapters

JOURNAL ARTICLE

Self-Supervised Learning from Untrimmed Videos via Hierarchical Consistency

Zhiwu Qing Shiwei Zhang Ziyuan Huang Yi Xu Xiang Wang Changxin Gao Rong Jin Nong Sang

Year: 2023 Journal: IEEE Transactions on Pattern Analysis and Machine Intelligence Vol: 45 (10)Pages: 12408-12426 Publisher: IEEE Computer Society

DOI: 10.1109/tpami.2023.3273415

Get Full-Text PDF Get Analytical Report

Abstract

Natural untrimmed videos provide rich visual content for self-supervised learning. Yet most previous efforts to learn spatio-temporal representations rely on manually trimmed videos, such as Kinetics dataset (Carreira and Zisserman 2017), resulting in limited diversity in visual patterns and limited performance gains. In this work, we aim to improve video representations by leveraging the rich information in natural untrimmed videos. For this purpose, we propose learning a hierarchy of temporal consistencies in videos, i.e., visual consistency and topical consistency, corresponding respectively to clip pairs that tend to be visually similar when separated by a short time span, and clip pairs that share similar topics when separated by a long time span. Specifically, we present a Hierarchical Consistency (HiCo++) learning framework, in which the visually consistent pairs are encouraged to share the same feature representations by contrastive learning, while topically consistent pairs are coupled through a topical classifier that distinguishes whether they are topic-related, i.e., from the same untrimmed video. Additionally, we impose a gradual sampling algorithm for the proposed hierarchical consistency learning, and demonstrate its theoretical superiority. Empirically, we show that HiCo++ can not only generate stronger representations on untrimmed videos, but also improve the representation quality when applied to trimmed videos. This contrasts with standard contrastive learning, which fails to learn powerful representations from untrimmed videos. Source code will be made available here.

Keywords:

Artificial intelligence Computer science Consistency (knowledge bases) Feature learning Machine learning Classifier (UML) Pattern recognition (psychology) Visualization

Metrics

Cited By

0.55

FWCI (Field Weighted Citation Impact)

144

Refs

0.59

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Human Pose and Action Recognition

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Domain Adaptation and Few-Shot Learning

Physical Sciences → Computer Science → Artificial Intelligence

Cancer-related molecular mechanisms research

Life Sciences → Biochemistry, Genetics and Molecular Biology → Cancer Research

Self-Supervised Learning from Untrimmed Videos via Hierarchical Consistency

Abstract

Metrics

Citation History

Topics

Related Documents

Learning from Untrimmed Videos: Self-Supervised Video Representation Learning with Hierarchical Consistency

Exploring Relations in Untrimmed Videos for Self-Supervised Learning

Activity-driven Weakly-Supervised Spatio-Temporal Grounding from Untrimmed Videos

Self-supervised and cross-modal learning from videos

WOAD: Weakly Supervised Online Action Detection in Untrimmed Videos