DISSERTATION

Motion-Aware Siamese Masked Autoencoders for Efficient Video Representation Learning

Abstract

Videos are a rich source of multimodal information, encompassing visual, auditory, temporal, and motion data. This complexity presents both opportunities and challenges for learning robust representations during pre-training. In the self-supervised learning paradigm, predictive techniques have shown great promise. Particularly, masked modeling approaches, where models reconstruct missing or masked portions of the input, have proven highly effective in enhancing the quality of learned representations. In this thesis, we present MoSiamMAE, a Motion-Aware Siamese Masked Autoencoder that enhances video representation learning through efficient motion integration. Our model builds upon the successful VideoMAE architecture, introducing a dual-stream architecture that processes both spatial content and motion information derived from frame-wise RGB differences. MoSiamMAE employs a Siamese network structure with shared-weight encoders and a cross-attention decoder, enabling effective information propagation across the temporal dimension. We evaluate MoSiamMAE on the UCF-101 action recognition benchmark, demonstrating competitive performance with VideoMAE. Our model achieves 58.14% Top-1 accuracy compared to VideoMAE's 55.74%, and with the incorporation of RGB difference loss, reaches 59.82% Top-1 accuracy. These results are achieved with a high masking ratio of 95%, highlighting our model's robustness. Our work contributes to the growing body of research on self-supervised video understanding, offering an efficient and effective approach to learning from both spatial and temporal aspects of video data.

Keywords:
Autoencoder Feature learning Representation (politics) RGB color model Motion (physics) Masking (illustration) Encoder Deep learning Trajectory

Metrics

0
Cited By
0.00
FWCI (Field Weighted Citation Impact)
0
Refs
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Topics

Human Pose and Action Recognition
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Generative Adversarial Networks and Image Synthesis
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
© 2026 ScienceGate Book Chapters — All rights reserved.