Lili ZhangShuyao DaiLihuang SheShuwei Huo
Three-dimensional human pose estimation (3D HPE) refers to converting the input image or video into the coordinates of the keypoints of the 3D human body in the coordinate system. At present, the mainstream implementation scheme of a 3D HPE task is to take the 2D pose estimation result as the intermediate process and then return it to the 3D pose. The general difficulty of this scheme is how to effectively extract the features between 2D joint points and return them to 3D coordinates in a highly nonlinear 3D space. In this paper, we propose a new algorithm, called TSHDC, to solve the above dilemma by considering the temporal and spatial characteristics of human joint points. By introducing the self-attention mechanism and the temporal convolutional network (TCN) into the 3D HPE task, the model can use only 27 frames of temporal receptive field to make the model have fewer parameters and faster convergence speed when the accuracy is not much different from the SOTA-level algorithm (+6.8 mm). The TSHDC model is deployed on the embedded platform JetsonTX2, and by deploying TensorRT, the model inference speed can be greatly improved (13.7 times) with only a small loss of accuracy (5%). The comprehensive experimental results on representative benchmarks show that our method outperforms the state-of-the-art methods in quantitative and qualitative evaluation.
Zhanhong YinRenchao QinChengzhuo YeYa LiYaying HeYue ShuRuilin Jiang
Tingjian YuZemin YuanTao HuangXiang Fu
Wenhan ZhuCheng ZhangJuexuan LiZeya Wang