Nartay AikynAssanali AbuTomiris ZhaksylykNguyen Anh Tu
This paper proposes a novel few-shot action recognition framework that integrates the Transformer-based feature backbone into meta-learning. The proposed method includes pre-training the Video Transformer and utilizing metric-based meta-learning with the ProtoNet algorithm. Extensive experiments on benchmark datasets demonstrate that our approach achieves remarkable performance, surpassing baseline models and obtaining competitive results compared to state-of-the-art models. Additionally, we investigate the impact of supervised and self-supervised learning on video representation and evaluate the transferability of the learned representations in cross-domain scenarios. Our approach suggests a promising direction for exploring the combination of meta-learning with Video Transformer in the context of few-shot learning tasks, potentially contributing to the field of action recognition in various domains.
Yutao TangBenjamı́n BéjarRenè Vidal
Yongqiang KongYunhong WangAnnan Li
SuBeen LeeWonJun MoonHyun Seok SeongJae‐Pil Heo
Nguyen Anh TuNartay AikynNursultan MakhanovAssanali AbuKok‐Seng WongMin-Ho Lee