Self-supervised Video Representation Learning Using Inter-intra Contrastive Framework

Tao Li; Xueting Wang; Toshihiko Yamasaki

doi:10.1145/3394171.3413694

ScienceGate Book Chapters

JOURNAL ARTICLE

Self-supervised Video Representation Learning Using Inter-intra Contrastive Framework

Tao Li Xueting Wang Toshihiko Yamasaki

Year: 2020 Pages: 2193-2201

DOI: 10.1145/3394171.3413694

Get Full-Text PDF Get Analytical Report

Abstract

We propose a self-supervised method to learn feature representations from\nvideos. A standard approach in traditional self-supervised methods uses\npositive-negative data pairs to train with contrastive learning strategy. In\nsuch a case, different modalities of the same video are treated as positives\nand video clips from a different video are treated as negatives. Because the\nspatio-temporal information is important for video representation, we extend\nthe negative samples by introducing intra-negative samples, which are\ntransformed from the same anchor video by breaking temporal relations in video\nclips. With the proposed Inter-Intra Contrastive (IIC) framework, we can train\nspatio-temporal convolutional networks to learn video representations. There\nare many flexible options in our IIC framework and we conduct experiments by\nusing several different configurations. Evaluations are conducted on video\nretrieval and video recognition tasks using the learned video representation.\nOur proposed IIC outperforms current state-of-the-art results by a large\nmargin, such as 16.7% and 9.5% points improvements in top-1 accuracy on UCF101\nand HMDB51 datasets for video retrieval, respectively. For video recognition,\nimprovements can also be obtained on these two benchmark datasets. Code is\navailable at\nhttps://github.com/BestJuly/Inter-intra-video-contrastive-learning.\n

Keywords:

Computer science Margin (machine learning) Benchmark (surveying) Artificial intelligence Feature (linguistics) Representation (politics) Feature learning Categorization Pattern recognition (psychology) CLIPS Convolutional neural network Feature extraction Machine learning

Metrics

108

Cited By

8.61

FWCI (Field Weighted Citation Impact)

Refs

0.98

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Human Pose and Action Recognition

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Video Analysis and Summarization

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Self-supervised Video Representation Learning Using Inter-intra Contrastive Framework

Abstract

Metrics

Citation History

Topics

Related Documents

An Improved Inter-Intra Contrastive Learning Framework on Self-Supervised Video Representation

Inter-Intra Cross-Modality Self-Supervised Video Representation Learning by Contrastive Clustering

Joint Self-Supervised Image-Volume Representation Learning with Intra-inter Contrastive Clustering

Cross-view motion consistent self-supervised video inter-intra contrastive for action representation understanding

Multitask Contrastive Learning for Self-Supervised Video Representation