JOURNAL ARTICLE

Multi‐stage part‐aware graph convolutional network for skeleton‐based action recognition

Abstract

IET Image ProcessingVolume 16, Issue 8 p. 2063-2074 ORIGINAL RESEARCHOpen Access Multi-stage part-aware graph convolutional network for skeleton-based action recognition Xiaofei Qin, Xiaofei Qin School of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology, Shanghai, ChinaSearch for more papers by this authorHao Li, Hao Li School of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology, Shanghai, ChinaSearch for more papers by this authorYuru Liu, Yuru Liu College of Science, University of Shanghai for Science and Technology, Shanghai, ChinaSearch for more papers by this authorJiabin Yu, Corresponding Author Jiabin Yu [email protected] orcid.org/0000-0001-7875-8098 Key Laboratory of Industrial Internet and Big Data, China National Light Industry, Beijing, China School of Artificial Intelligence, Beijing Technology and Business University, Beijing, China Correspondence Jiabin Yu, Key Laboratory of Industrial Internet and Big Data, China National Light Industry, Beijing 100048, China. Email: [email protected] for more papers by this authorChangxiang He, Changxiang He College of Science, University of Shanghai for Science and Technology, Shanghai, ChinaSearch for more papers by this authorXuedian Zhang, Xuedian Zhang School of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology, Shanghai, China Shanghai Institute of Intelligent Science and Technology, Tongji University, Shanghai, China Shanghai Key Laboratory of Contemporary Optics System, Shanghai, China Key Laboratory of Biomedical Optical Technology and Devices of Ministry of Education, Shanghai, ChinaSearch for more papers by this author Xiaofei Qin, Xiaofei Qin School of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology, Shanghai, ChinaSearch for more papers by this authorHao Li, Hao Li School of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology, Shanghai, ChinaSearch for more papers by this authorYuru Liu, Yuru Liu College of Science, University of Shanghai for Science and Technology, Shanghai, ChinaSearch for more papers by this authorJiabin Yu, Corresponding Author Jiabin Yu [email protected] orcid.org/0000-0001-7875-8098 Key Laboratory of Industrial Internet and Big Data, China National Light Industry, Beijing, China School of Artificial Intelligence, Beijing Technology and Business University, Beijing, China Correspondence Jiabin Yu, Key Laboratory of Industrial Internet and Big Data, China National Light Industry, Beijing 100048, China. Email: [email protected] for more papers by this authorChangxiang He, Changxiang He College of Science, University of Shanghai for Science and Technology, Shanghai, ChinaSearch for more papers by this authorXuedian Zhang, Xuedian Zhang School of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology, Shanghai, China Shanghai Institute of Intelligent Science and Technology, Tongji University, Shanghai, China Shanghai Key Laboratory of Contemporary Optics System, Shanghai, China Key Laboratory of Biomedical Optical Technology and Devices of Ministry of Education, Shanghai, ChinaSearch for more papers by this author First published: 09 March 2022 https://doi.org/10.1049/ipr2.12469AboutSectionsPDF ToolsRequest permissionExport citationAdd to favoritesTrack citation ShareShare Give accessShare full text accessShare full-text accessPlease review our Terms and Conditions of Use and check box below to share full-text version of article.I have read and accept the Wiley Online Library Terms and Conditions of UseShareable LinkUse the link below to share a full-text version of this article with your friends and colleagues. Learn more.Copy URL Share a linkShare onFacebookTwitterLinked InRedditWechat Abstract Recently, graph convolutional networks have shown excellent results in skeleton-based action recognition. This paper presents a multi-stage part-aware graph convolutional network for the problems of model over complication, parameter redundancy and lack of long-dependence feature information. The structure of this network has a multi-stream input and two-stream output, which can greatly reduce the complexity and improve the accuracy of the model without losing sequence information. The two branches of the network have the same backbone, which includes 6 multi-order feature extraction blocks and 3 temporal attention calibration blocks, and the outputs of the two branches are fused together. In multi-order feature extraction block, a channel-spatial attention mechanism and a graph condensation module are proposed, which can extract more distinguishable feature and identify the relationship between parts. In temporal attention calibration block, the temporal dependencies between frames in the skeleton sequence are modeled. Experimental results show that the proposed network outperforms many mainstream methods on NTU and Kinetics datasets, for example, it achieves 92.4% accuracy on the cross-subject benchmark of NTU-RGBD60 dataset. 1 INTRODUCTION Action recognition task can often be divided into video-based and skeleton-based. Compared to video data, skeleton data is much more robust in dynamic environments with complicated backgrounds. In addition, most action information is contained in skeleton sequences which can model human body structure information well, such as the joint positions, bone links etc. However, video sequences often introduce a lot of noise information, such as color, texture, body size etc. Meanwhile, advances in low-cost depth cameras [5] and pose estimation technology [6] have made skeleton data much easier to access than before. Therefore, skeleton-based action recognition has been widely studied by researchers in recent years, and gathered much momentum in the field of human–computer interaction, health care, robotics [1-4]. In the past few years, many methods based on convolutional neural network (CNN) [7-9] and recurrent neural network (RNN) [10-12] have been proposed, which overcome the inability of handcrafted feature methods [13-15], and effectively model the spatiotemporal information in the action sequence. However, the skeleton of a human body naturally serves as a graph in non-Euclidean space, and the dependencies between the associated joints cannot be fully expressed if the skeleton is simply viewed as a vector sequence or a 2D grid. Recently, some methods based on graph convolutional network (GCN) have been proposed for this task [16-22]. Compared with CNN and RNN, GCN can represent graph data more effectively. Among these GCN based methods, 2-stream adaptive graph convolutional networks (2s-AGCN) [19], and semantics-guided neural networks (SGN) [22] are typical and popular ones. However, there are three drawbacks in typical GCN based methods like SGN and 2s-AGCN: (1) In actual life, people often need multiple synergy among distant joints to complete an action. However, typical GCNs ignore the dependency among non-physical connection joints, because one layer of GCN only aggregates the first-order neighbor information. In order to model high-order neighbor information, GCN layers have to be stacked many times, this will cause unnecessary redundancy and weaken the network's ability to aggregate global semantic information. For example, when people clap hands, the movement of two hands are most important. But the features of two hands have to be transferred through two arms and chest, and finally connected with each other. (2) The skeleton graphs in different layers of typical GCNs are the same, which would prevent GCN network to model the semantic information contained in different layers efficiently, and affect the flexibility of the network. (3) Typical GCN networks did not fully exploit the potential of the attention mechanism, and the importance of different feature maps were not clearly modeled, leading to possible loss of distinguishable information. To address above issues, a novel multi-stage part-aware graph convolutional network (MS-PGCN) is proposed; its structure is shown in Figure 1. The backbone of MS-PGCN includes some multi-order features extraction blocks (MFEBs) and some temporal attention calibration blocks (TACBs). From the spatial perspective, the MFEB contains a channel-spatial attention mechanism (CSAM) and a graph condensation module (GCM). The CSAM can boost the spatiotemporal feature capture ability of MFEB, and the GCM is designed to condense the related joints. From the temporal perspective, the role of TACB is to integrate and calibrate the inter-frame information. Furthermore, three kinds of information from the skeleton sequence are used as the input of the network, which are bone vectors, joint positions and motion velocities. The input of each branch is the joint data and the bone data that have been concatenated with the motion data, respectively. The results of s o f t m a x $softmax$ of the two branches are fused together to generate the final recognize result. FIGURE 1Open in figure viewerPowerPoint Structure of MS-PGCN. In the feature extraction part, one green block represents an MFEB, and one yellow block represents a TACB. The prediction scores of two branches are combined to obtain final prediction score Comprehensive experiments on three large public datasets are concucted to demonstrate the effectiveness of the proposed MS-PGCN: NTU-RGB+D60 [23], NTU-RGB+D120 [24] and Kinetics [25]. According to the experiment results, MS-PGCN shows better performance than many state-of-the-art approaches. The following are the primary contributions of this work: Two novel blocks are proposed, that is, MFEB and TACB. Based on a kind of channel-spatial attention mechanism and a manually designed joints condensation scheme, the MFEB is designed. Based on the attention mechanism and convolutional operations in the te

Keywords:
Beijing China The Internet Christian ministry Key (lock) Computer Science and Engineering

Metrics

0
Cited By
0.00
FWCI (Field Weighted Citation Impact)
0
Refs
0.29
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Topics

Human Pose and Action Recognition
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Hand Gesture Recognition Systems
Physical Sciences →  Computer Science →  Human-Computer Interaction
Context-Aware Activity Recognition Systems
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
© 2026 ScienceGate Book Chapters — All rights reserved.