JOURNAL ARTICLE

3D human pose estimation in video with temporal and spatial transformer

Abstract

Previous works on 3D human pose estimation have concentrated on predicting the 3D pose of the human body from a single image, ignoring correlation between adjacent frames in video. We design a transformer network structure that can extract video temporal information, and enhance the accuracy of human pose prediction by encoding relative position with temporal fusion transformer structure to enhance local feature learning capability. On Human3.6M, we quantitatively and qualitatively analyze our method. Research suggests that our TSFormer achieves state-of-the-art performance.

Keywords:
Pose Artificial intelligence Computer science Transformer Computer vision Pattern recognition (psychology) 3D pose estimation Feature extraction Encoding (memory) Engineering Voltage

Metrics

0
Cited By
0.00
FWCI (Field Weighted Citation Impact)
0
Refs
0.05
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Topics

Human Pose and Action Recognition
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Video Surveillance and Tracking Methods
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Gait Recognition and Analysis
Physical Sciences →  Engineering →  Biomedical Engineering
© 2026 ScienceGate Book Chapters — All rights reserved.