Skeletal Spatial-Temporal Semantics Guided Homogeneous-Heterogeneous Multimodal Network for Action Recognition

Chenwei Zhang; Yuxuan Hu; Min Yang; Chengming Li; Xiping Hu

doi:10.1145/3581783.3612560

ScienceGate Book Chapters

JOURNAL ARTICLE

Skeletal Spatial-Temporal Semantics Guided Homogeneous-Heterogeneous Multimodal Network for Action Recognition

Chenwei Zhang Yuxuan Hu Min Yang Chengming Li Xiping Hu

Year: 2023 Pages: 3657-3666

DOI: 10.1145/3581783.3612560

Get Full-Text PDF Get Analytical Report

Abstract

Action recognition research has gained significant attention with two dominant unimodal approaches: skeleton-based and RGB video-based. While the former is known for its robustness in complex backgrounds, the latter provides rich environmental information useful for context-based analysis. However, the fusion of these two modalities remains an open challenge. In this paper, we propose a Spatial Transformer & Selective Temporal encoder (ST&ST) for skeleton-based action recognition by constructing two modules: Reranking-Enhanced Dynamic Mask Transformer (RE-DMT) and Selective Kernel Temporal Convolution (SK-TC). The RE-DMT captures global spatial features, while the dynamic mask strategy and reranking strategy reduce redundancy. The SK-TC captures both long-term and short-term temporal features and enables adaptive fusion. Furthermore, in two phases, we propose a Homogeneous-Heterogeneous Multimodal Network (HHMNet) for multi-modal action recognition. In the first phase, contrastive learning is employed to achieve implicit semantic fusion within the four homogeneous skeletal modalities (joint, bone, etc.). In the second phase, the fusion of heterogeneous modalities (skeleton & RGB video) is carried out at three levels: model, feature, and decision. At the model level, the powerful skeleton-based model from the previous phase provides explicit attention guidance to the RGB video-based model. At the feature level, multi-part contrastive learning enables semantic distillation between heterogeneous modalities. At the decision level, ensemble learning combines outputs for final action recognition. We evaluate our proposed ST&ST guided HHMNet on NTU RGB+D 60 & 120 and NW-UCLA datasets and demonstrate that it achieves state-of-the-art performance in both skeleton-based and multi-modal action recognition tasks.

Keywords:

Computer science Artificial intelligence Action recognition RGB color model Pattern recognition (psychology) Computer vision

Metrics

Cited By

0.73

FWCI (Field Weighted Citation Impact)

Refs

0.67

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Human Pose and Action Recognition

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Hand Gesture Recognition Systems

Physical Sciences → Computer Science → Human-Computer Interaction

Gait Recognition and Analysis

Physical Sciences → Engineering → Biomedical Engineering

Skeletal Spatial-Temporal Semantics Guided Homogeneous-Heterogeneous Multimodal Network for Action Recognition

Abstract

Metrics

Citation History

Topics

Related Documents

Energy-Guided Temporal Segmentation Network for Multimodal Human Action Recognition

A Video Action Recognition Model Guided by Temporal Action Semantics

Research on Human Upper Limb Action Recognition Method Based on Multimodal Heterogeneous Spatial Temporal Graph Network

STAR: Spatial Temporal Network for Action Recognition

Action Recognition Network Based on Temporal Spatial Temporal Mode