Motion-Aware Siamese Masked Autoencoders for Efficient Video Representation Learning

Alharbi, Hanan H.

doi:10.25781/kaust-57n99

ScienceGate Book Chapters

DISSERTATION

Motion-Aware Siamese Masked Autoencoders for Efficient Video Representation Learning

Alharbi, Hanan H.

Year: 2024 University: KAUST Research Repository

DOI: 10.25781/kaust-57n99

Get Full-Text PDF Get Analytical Report

Abstract

Videos are a rich source of multimodal information, encompassing visual, auditory, temporal, and motion data. This complexity presents both opportunities and challenges for learning robust representations during pre-training. In the self-supervised learning paradigm, predictive techniques have shown great promise. Particularly, masked modeling approaches, where models reconstruct missing or masked portions of the input, have proven highly effective in enhancing the quality of learned representations. In this thesis, we present MoSiamMAE, a Motion-Aware Siamese Masked Autoencoder that enhances video representation learning through efficient motion integration. Our model builds upon the successful VideoMAE architecture, introducing a dual-stream architecture that processes both spatial content and motion information derived from frame-wise RGB differences. MoSiamMAE employs a Siamese network structure with shared-weight encoders and a cross-attention decoder, enabling effective information propagation across the temporal dimension. We evaluate MoSiamMAE on the UCF-101 action recognition benchmark, demonstrating competitive performance with VideoMAE. Our model achieves 58.14% Top-1 accuracy compared to VideoMAE's 55.74%, and with the incorporation of RGB difference loss, reaches 59.82% Top-1 accuracy. These results are achieved with a high masking ratio of 95%, highlighting our model's robustness. Our work contributes to the growing body of research on self-supervised video understanding, offering an efficient and effective approach to learning from both spatial and temporal aspects of video data.

Keywords:

Autoencoder Feature learning Representation (politics) RGB color model Motion (physics) Masking (illustration) Encoder Deep learning Trajectory

Metrics

Cited By

0.00

FWCI (Field Weighted Citation Impact)

Refs

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Topics

Human Pose and Action Recognition

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Generative Adversarial Networks and Image Synthesis

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Motion-Aware Siamese Masked Autoencoders for Efficient Video Representation Learning

Abstract

Metrics

Topics

Related Documents

Siamese Multi-View Masked Autoencoders for Skeleton-Based Action Representation Learning

Hybrid Siamese Masked Autoencoders as Unsupervised Video Summarizer

Efficient Image Pre-training with Siamese Cropped Masked Autoencoders

Masked Motion Encoding for Self-Supervised Video Representation Learning

Class-aware graph Siamese representation learning