Haoyuan ZhangYonghong HouWenjing Zhang
In this paper, we investigate unsupervised representation learning for skeleton action recognition, and develop a simple yet effective framework: SKeletal Twins (SKT), which is capable of learning representations from unlabeled skeleton data. To be specific, we choose skeleton-specific spatial and temporal augmentations for spatio-temporal dynamics learning, then the augmented skeleton sequence is represented as a graph with both spatial and temporal edges so that the GCN-based twin encoders are able to encode human pose and joint's temporal motion. Barlow Twins' objective function is used to minimize the redundancy and keep similarity of different skeleton augmentations. However it ignores the instance-level consistency of the skeleton instance from different augmentations, thus an instance-level consistency-enhanced objective function is designed and jointly optimized, which boosts the representation learning. Extensive experiments verify that the proposed framework obtains the state-of-the-art results on the challenging NTU-60 and NTU-120 datasets.
Lilang LinLehong WuJiahang ZhangJiaying Liu
Jianfeng DongShengkai SunZhonglin LiuShujie ChenBaolong LiuXun Wang
Wenjing ZhangYonghong HouHaoyuan Zhang