JOURNAL ARTICLE

Few-shot Action Recognition with Video Transformer

Abstract

This paper proposes a novel few-shot action recognition framework that integrates the Transformer-based feature backbone into meta-learning. The proposed method includes pre-training the Video Transformer and utilizing metric-based meta-learning with the ProtoNet algorithm. Extensive experiments on benchmark datasets demonstrate that our approach achieves remarkable performance, surpassing baseline models and obtaining competitive results compared to state-of-the-art models. Additionally, we investigate the impact of supervised and self-supervised learning on video representation and evaluate the transferability of the learned representations in cross-domain scenarios. Our approach suggests a promising direction for exploring the combination of meta-learning with Video Transformer in the context of few-shot learning tasks, potentially contributing to the field of action recognition in various domains.

Keywords:
Computer science Transformer Action recognition Shot (pellet) Artificial intelligence Computer vision Pattern recognition (psychology) Speech recognition Engineering Electrical engineering Voltage Materials science

Metrics

0
Cited By
0.00
FWCI (Field Weighted Citation Impact)
39
Refs
0.22
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Topics

Human Pose and Action Recognition
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Anomaly Detection Techniques and Applications
Physical Sciences →  Computer Science →  Artificial Intelligence
Video Surveillance and Tracking Methods
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
© 2026 ScienceGate Book Chapters — All rights reserved.