PGVT: Pose-Guided Video Transformer for Fine-Grained Action Recognition

Haosong Zhang; Mei Chee Leong; Liyuan Li; Weisi Lin

doi:10.1109/wacv57701.2024.00651

ScienceGate Book Chapters

JOURNAL ARTICLE

PGVT: Pose-Guided Video Transformer for Fine-Grained Action Recognition

Haosong Zhang Mei Chee Leong Liyuan Li Weisi Lin

Year: 2024 Pages: 6631-6642

DOI: 10.1109/wacv57701.2024.00651

Get Full-Text PDF Get Analytical Report

Abstract

Based on recent advancements in transformer-based video models and multi-modal joint learning, we propose a novel model, named Pose-Guided Video Transformer (PGVT), to incorporate sparse high-level body joints locations and dense low-level visual pixels for effective learning and accurate recognition of human actions. PGVT leverages the pre-trained image models by freezing their parameters and introducing trainable adapters to effectively integrate two input modalities, i.e., human poses and video frames, to learn a pose-focused spatiotemporal representation of human actions. We design two novel core modules, i.e., Pose Temporal Attention and Pose-Video Spatial Attention, to facilitate interaction between body joint locations and uniform video tokens, enriching each modality with contextualized information from the other. We evaluate PGVT model on four action recognition datasets: Diving48, Gym99, and Gym288 for fine-grained action recognition, and Kinetics400 for coarse-grained action recognition. Our model achieves new SOTA performance on the three fine-grained human action recognition datasets and comparable performance on Kinetics400 with a small number of tunable parameters compared with SOTA methods. Various ablation studies are performed which verify the benefits of our new designs.

Keywords:

Computer science Transformer Action recognition Artificial intelligence Computer vision Engineering Voltage Electrical engineering

Metrics

Cited By

4.24

FWCI (Field Weighted Citation Impact)

114

Refs

0.89

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Human Pose and Action Recognition

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Anomaly Detection Techniques and Applications

Physical Sciences → Computer Science → Artificial Intelligence

Gait Recognition and Analysis

Physical Sciences → Engineering → Biomedical Engineering

PGVT: Pose-Guided Video Transformer for Fine-Grained Action Recognition

Abstract

Metrics

Citation History

Topics

Related Documents

Pose-Guided Transformer for Fine-Grained Action Quality Assessment

Video Pose Distillation for Few-Shot, Fine-Grained Sports Action Recognition

Pose-Guided Fine-Grained Sign Language Video Generation

Convolutional transformer network for fine-grained action recognition

Convolutional Transformer Network for Fine-Grained Action Recognition