Previous works on 3D human pose estimation have concentrated on predicting the 3D pose of the human body from a single image, ignoring correlation between adjacent frames in video. We design a transformer network structure that can extract video temporal information, and enhance the accuracy of human pose prediction by encoding relative position with temporal fusion transformer structure to enhance local feature learning capability. On Human3.6M, we quantitatively and qualitatively analyze our method. Research suggests that our TSFormer achieves state-of-the-art performance.
Yongpeng WuDehui KongShaofan WangJinghua LiBaocai Yin
Haijian WangQingxuan ShiBeiguang Shan