Tianming ZhuangZhen QinYi DingFuhu DengLeduo ChenZhiguang QinKim‐Kwang Raymond Choo
Human skeleton data, which has served in the aspect of human activity recognition, ought to be the most representative biometric characteristics due to its intuitivity and visuality. The state-of-the-art approaches mainly focus on improving modeling spatial correlations within graph topologies. However, the interframes motional representations are also of vital importance, and we argue that they are worth paying attention to and exploring. Therefore, a temporal refinement module with contrastive learning mechanism is proposed, fuzing as a complementary to the conventional spatial graph convolution layer. In addition, in order to further exploiting the interframe variances and generalizing graph convolutional network (GCN) operation to temporal dimension, a temporal-correlation matrix is introduced to effectively capture dynamic dependencies within frame-pairs, enhancing semantic feature representation. Moreover, since GCN is a typical local operator which lacks of capability to fully model the long-term relations along spatial and temporal variation, to move beyond the limitation, a spatial-temporal cascaded aggregation (STCA) module is designed to enlarge the receptive filter scale. The overall recognition framework consists of three above novelties, which is capable of achieving remarkable performance by evaluating on benchmark datasets (i.e., NTU RGB+D 60, NTU RGB+D 120, PKU-MMD, and Kinetics Skeleton 400). Extensive experiments demonstrate the effectiveness of the proposed framework, e.g., performing recognition accuracy rate of 90.9% and 96.8% on NTU RGB+D 60, 87.9% and 88.9% on NTU RGB+D 120.
Xuanfeng LiJian LüJian ZhouWei LiuKaibing Zhang
Shijie LiJinhui YiYazan Abu FarhaJüergen Gall
Chengyuan KeSheng LiuZhenghao KeYuan FengShengyong Chen
Wenkang LiuYifan WangXiaoheng ZhangYang Li