JOURNAL ARTICLE

Swin-Pose: Swin Transformer Based Human Pose Estimation

Abstract

Convolutional neural networks (CNNs) have been widely utilized in many computer vision tasks. However, CNNs have a fixed reception field and lack the ability of long-range perception, which is crucial to human pose estimation. Transformer architecture has been adopted to computer vision applications recently and is proven to be a highly effective architecture. We are interested in exploring its capability in human pose estimation, and thus propose a novel model based on transformer, enhanced with a feature pyramid fusion structure. More specifically, we use pre-trained Swin Transformer to extract features, and leverage a feature pyramid structure to extract and fuse feature maps from different stages. The experiment results of our study have demonstrated that the proposed transformer-based model can achieve better performance compared to the state-of-the-art CNN-based models.

Keywords:
Artificial intelligence Computer science Transformer Pose Convolutional neural network Leverage (statistics) Computer vision Pattern recognition (psychology) Feature extraction 3D pose estimation Architecture Engineering Voltage

Metrics

29
Cited By
3.59
FWCI (Field Weighted Citation Impact)
28
Refs
0.92
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Human Pose and Action Recognition
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Video Surveillance and Tracking Methods
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Advanced Neural Network Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
© 2026 ScienceGate Book Chapters — All rights reserved.