JOURNAL ARTICLE

PGVT: Pose-Guided Video Transformer for Fine-Grained Action Recognition

Abstract

Based on recent advancements in transformer-based video models and multi-modal joint learning, we propose a novel model, named Pose-Guided Video Transformer (PGVT), to incorporate sparse high-level body joints locations and dense low-level visual pixels for effective learning and accurate recognition of human actions. PGVT leverages the pre-trained image models by freezing their parameters and introducing trainable adapters to effectively integrate two input modalities, i.e., human poses and video frames, to learn a pose-focused spatiotemporal representation of human actions. We design two novel core modules, i.e., Pose Temporal Attention and Pose-Video Spatial Attention, to facilitate interaction between body joint locations and uniform video tokens, enriching each modality with contextualized information from the other. We evaluate PGVT model on four action recognition datasets: Diving48, Gym99, and Gym288 for fine-grained action recognition, and Kinetics400 for coarse-grained action recognition. Our model achieves new SOTA performance on the three fine-grained human action recognition datasets and comparable performance on Kinetics400 with a small number of tunable parameters compared with SOTA methods. Various ablation studies are performed which verify the benefits of our new designs.

Keywords:
Computer science Transformer Action recognition Artificial intelligence Computer vision Engineering Voltage Electrical engineering

Metrics

8
Cited By
4.24
FWCI (Field Weighted Citation Impact)
114
Refs
0.89
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Human Pose and Action Recognition
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Anomaly Detection Techniques and Applications
Physical Sciences →  Computer Science →  Artificial Intelligence
Gait Recognition and Analysis
Physical Sciences →  Engineering →  Biomedical Engineering

Related Documents

JOURNAL ARTICLE

Pose-Guided Transformer for Fine-Grained Action Quality Assessment

Yanting ZhangLi XiaWenhao ChaiCairong YanWenhai WangGaoang Wang

Journal:   IEEE Transactions on Circuits and Systems for Video Technology Year: 2025 Vol: 35 (8)Pages: 7940-7952
JOURNAL ARTICLE

Video Pose Distillation for Few-Shot, Fine-Grained Sports Action Recognition

James Won‐Ki HongMatthew FisherMichaël GharbiKayvon Fatahalian

Journal:   2021 IEEE/CVF International Conference on Computer Vision (ICCV) Year: 2021 Pages: 9234-9243
JOURNAL ARTICLE

Convolutional transformer network for fine-grained action recognition

Yujun MaRuili WangMing ZongWanting JiYi WangBaoliu Ye

Journal:   Neurocomputing Year: 2023 Vol: 569 Pages: 127027-127027
© 2026 ScienceGate Book Chapters — All rights reserved.