We propose a self-supervised method to learn feature representations from\nvideos. A standard approach in traditional self-supervised methods uses\npositive-negative data pairs to train with contrastive learning strategy. In\nsuch a case, different modalities of the same video are treated as positives\nand video clips from a different video are treated as negatives. Because the\nspatio-temporal information is important for video representation, we extend\nthe negative samples by introducing intra-negative samples, which are\ntransformed from the same anchor video by breaking temporal relations in video\nclips. With the proposed Inter-Intra Contrastive (IIC) framework, we can train\nspatio-temporal convolutional networks to learn video representations. There\nare many flexible options in our IIC framework and we conduct experiments by\nusing several different configurations. Evaluations are conducted on video\nretrieval and video recognition tasks using the learned video representation.\nOur proposed IIC outperforms current state-of-the-art results by a large\nmargin, such as 16.7% and 9.5% points improvements in top-1 accuracy on UCF101\nand HMDB51 datasets for video retrieval, respectively. For video recognition,\nimprovements can also be obtained on these two benchmark datasets. Code is\navailable at\nhttps://github.com/BestJuly/Inter-intra-video-contrastive-learning.\n
Tao LiXueting WangToshihiko Yamasaki
Jiutong WeiGuan LuoBing LiWeiming Hu
Duy M. H. NguyenHoang NguyenMai Thanh Nhat TruongTri CaoBinh T. NguyenNhat HoPaul SwobodaShadi AlbarqouniPengtao XieDaniel Sonntag
Shuai BiZhengping HuHehao ZhangJirui DiZhe Sun