Unified Multi-modal Unsupervised Representation Learning for Skeleton-based Action Understanding

Shengkai Sun; Daizong Liu; Jianfeng Dong; Xiaoye Qu; Junyu Gao; Xun Yang; Xun Wang; Meng Wang

doi:10.1145/3581783.3612449

ScienceGate Book Chapters

JOURNAL ARTICLE

Unified Multi-modal Unsupervised Representation Learning for Skeleton-based Action Understanding

Shengkai Sun Daizong Liu Jianfeng Dong Xiaoye Qu Junyu Gao Xun Yang Xun Wang Meng Wang

Year: 2023 Pages: 2973-2984

DOI: 10.1145/3581783.3612449

Get Full-Text PDF Get Analytical Report

Abstract

Unsupervised pre-training has shown great success in skeleton-based action understanding recently. Existing works typically train separate modality-specific models (i.e., joint, bone, and motion), then integrate the multi-modal information for action understanding by a late-fusion strategy. Although these approaches have achieved significant performance, they suffer from the complex yet redundant multi-stream model designs, each of which is also limited to the fixed input skeleton modality. To alleviate these issues, in this paper, we propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL, which exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner. Specifically, instead of designing separate modality-specific optimization processes for uni-modal unsupervised learning, we feed different modality inputs into the same stream with an early-fusion strategy to learn their multi-modal features for reducing model complexity. To ensure that the fused multi-modal features do not exhibit modality bias, i.e., being dominated by a certain modality input, we further propose both intra- and inter-modal consistency learning to guarantee that the multi-modal features contain the complete semantics of each modal via feature decomposition and distinct alignment. In this manner, our framework is able to learn the unified representations of uni-modal or multi-modal skeleton input, which is flexible to different kinds of modality input for robust action understanding in practical cases. Extensive experiments conducted on three large-scale datasets, i.e., NTU-60, NTU-120, and PKU-MMD II, demonstrate that UmURL is highly efficient, possessing the approximate complexity with the uni-modal methods, while achieving new state-of-the-art performance across various downstream task scenarios in skeleton-based action representation learning. Our source code is available at https://github.com/HuiGuanLab/UmURL.

Keywords:

Computer science Modal Modality (human–computer interaction) Artificial intelligence Feature learning Representation (politics) Semantics (computer science) Feature (linguistics) Unsupervised learning Machine learning Consistency (knowledge bases) Action (physics) Pattern recognition (psychology)

Metrics

Cited By

3.28

FWCI (Field Weighted Citation Impact)

Refs

0.91

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Human Pose and Action Recognition

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Anomaly Detection Techniques and Applications

Physical Sciences → Computer Science → Artificial Intelligence

Unified Multi-modal Unsupervised Representation Learning for Skeleton-based Action Understanding

Abstract

Metrics

Citation History

Topics

Related Documents

Skeletal Twins: Unsupervised Skeleton-Based Action Representation Learning

Idempotent Unsupervised Representation Learning for Skeleton-Based Action Recognition

Hierarchical Contrast for Unsupervised Skeleton-Based Action Representation Learning

Representation modeling learning with multi-domain decoupling for unsupervised skeleton-based action recognition

Unsupervised skeleton-based action representation learning via relation consistency pursuit