JOURNAL ARTICLE

Multimodal contrastive learning for unsupervised video representation learning

Abstract

In this paper, we propose a multimodal unsupervised video learning algorithm designed to incorporate information from any number of modalities present in the data. We cooperatively train a network corresponding to each modality: at each stage of training, one of these networks is selected to be trained using the output of the other networks. To verify our algorithm, we train a model using RGB, optical flow, and audio. We then evaluate the effectiveness of our unsupervised learning model by performing action classification and nearest neighbor retrieval on a supervised dataset. We compare this triple modality model to contrastive learning models using one or two modalities, and find that using all three modalities in tandem provides a 1.5% improvement in UCF101 classification accuracy, a 1.4% improvement in R@1 retrieval recall, a 3.5% improvement in R@5 retrieval recall, and a 2.4% improvement in R@10 retrieval recall as compared to using only RGB and optical flow, demonstrating the merit of utilizing as many modalities as possible in a cooperative learning model.

Keywords:
Computer science Artificial intelligence Modalities Modality (human–computer interaction) Recall Feature learning Unsupervised learning Deep learning Machine learning Optical flow Multimodal learning RGB color model Representation (politics) Pattern recognition (psychology) Image (mathematics)

Metrics

0
Cited By
0.00
FWCI (Field Weighted Citation Impact)
39
Refs
0.01
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Topics

Human Pose and Action Recognition
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Music and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing

Related Documents

JOURNAL ARTICLE

Kalman contrastive unsupervised representation learning

Mohammad Mahdi Jahani Yekta

Journal:   Scientific Reports Year: 2024 Vol: 14 (1)Pages: 30243-30243
JOURNAL ARTICLE

Contrastive Learning for Unsupervised Video Highlight Detection

Taivanbat BadamdorjMrigank RochanYang WangLi Cheng

Journal:   2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Year: 2022 Pages: 14022-14032
BOOK-CHAPTER

Enhancing Unsupervised Video Representation Learning by Temporal Contrastive Modelling Using 2D CNN

Vidit KumarVikas TripathiBhaskar Pant

Communications in computer and information science Year: 2022 Pages: 494-503
© 2026 ScienceGate Book Chapters — All rights reserved.