Abstract

We introduce S2VS, a video similarity learning approach with self-supervision. Self-Supervised Learning (SSL) is typically used to train deep models on a proxy task so as to have strong transferability on target tasks after fine-tuning. Here, in contrast to prior work, SSL is used to perform video similarity learning and address multiple retrieval and detection tasks at once with no use of labeled data. This is achieved by learning via instance-discrimination with task-tailored augmentations and the widely used InfoNCE loss together with an additional loss operating jointly on self-similarity and hard-negative similarity. We benchmark our method on tasks where video relevance is defined with varying granularity, ranging from video copies to videos depicting the same incident or event. We learn a single universal model that achieves state-of-the-art performance on all tasks, surpassing previously proposed methods that use labeled data. The code and pretrained models are publicly available at: https://github.com/gkordo/s2vs

Keywords:
Computer science Similarity (geometry) Artificial intelligence Benchmark (surveying) Granularity Task (project management) Machine learning Relevance (law) Source code Code (set theory) Labeled data Ranging Image (mathematics)

Metrics

17
Cited By
4.34
FWCI (Field Weighted Citation Impact)
105
Refs
0.93
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Domain Adaptation and Few-Shot Learning
Physical Sciences →  Computer Science →  Artificial Intelligence
Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Human Pose and Action Recognition
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition

Related Documents

JOURNAL ARTICLE

Rethinking Self-supervised Correspondence Learning: A Video Frame-level Similarity Perspective

Jiarui XuXiaolong Wang

Journal:   2021 IEEE/CVF International Conference on Computer Vision (ICCV) Year: 2021 Pages: 10055-10065
DISSERTATION

Self-supervised video representation learning

Han, Tengda

University:   Oxford University Research Archive (ORA) (University of Oxford) Year: 2022
JOURNAL ARTICLE

S4: Self-Supervised Learning of Spatiotemporal Similarity

Gleb TkachevSteffen FreyThomas Ertl

Journal:   IEEE Transactions on Visualization and Computer Graphics Year: 2021 Vol: 28 (12)Pages: 4713-4727
© 2026 ScienceGate Book Chapters — All rights reserved.