Videos are a rich source of multimodal information, encompassing visual, auditory, temporal, and motion data. This complexity presents both opportunities and challenges for learning robust representations during pre-training. In the self-supervised learning paradigm, predictive techniques have shown great promise. Particularly, masked modeling approaches, where models reconstruct missing or masked portions of the input, have proven highly effective in enhancing the quality of learned representations. In this thesis, we present MoSiamMAE, a Motion-Aware Siamese Masked Autoencoder that enhances video representation learning through efficient motion integration. Our model builds upon the successful VideoMAE architecture, introducing a dual-stream architecture that processes both spatial content and motion information derived from frame-wise RGB differences. MoSiamMAE employs a Siamese network structure with shared-weight encoders and a cross-attention decoder, enabling effective information propagation across the temporal dimension. We evaluate MoSiamMAE on the UCF-101 action recognition benchmark, demonstrating competitive performance with VideoMAE. Our model achieves 58.14% Top-1 accuracy compared to VideoMAE's 55.74%, and with the incorporation of RGB difference loss, reaches 59.82% Top-1 accuracy. These results are achieved with a high masking ratio of 95%, highlighting our model's robustness. Our work contributes to the growing body of research on self-supervised video understanding, offering an efficient and effective approach to learning from both spatial and temporal aspects of video data.
Ronggui LiuZiliang RenWenhong WeiZiyang ZhengQieshi Zhang
Yifei XuZaiqiang WuLi LiSiqi LiWenlong LiMingqi LiYuan RaoShuiguang Deng
Alexandre EymaëlRenaud VandeghenAnthony CioppaSilvio GiancolaBernard GhanemMarc Van Droogenbroeck
Xinyu SunPeihao ChenLiangwei ChenChanghao LiThomas H. LiMingkui TanChuang Gan
Chengcheng XuTianfeng WangMan ChenJun ChenZhisong Pan