Xi ZhangChek Tien TanYuan YuanYan Jiang
Graph Convolutional Networks (GCNs) have emerged as a leading approach for human skeleton-based action recognition, owing to their capacity to represent skeletal joints as adaptive graphs that effectively capture complex spatial relationships for feature aggregation. However, existing methods predominantly emphasize either spatial context within individual frames or holistic temporal sequences, often overlooking the interplay of spatial topology across multiple temporal scales. This limitation hinders the model's ability to fully understand complex actions, especially those involving interactions that vary across different temporal phases. To address this challenge, we propose a Hierarchical Intertwined Graph Learning Framework (HI-GCN), which comprises two key modules: Intertwined Context Graph Convolution and Shifted Window Temporal Transformer. The former module integrates spatial-temporal information from adjacent frames at various temporal scales, thereby refining spatial relationship representations and capturing subtle topological variations that conventional GCNs tend to miss. The latter module advances temporal dependency modeling by applying shifted temporal windows with multi-scale receptive fields. Experimental results demonstrate that HI-GCN surpasses current state-of-the-art methods on multiple skeleton-based action recognition benchmarks, achieving accuracies of 93.3% on NTU RGB+D 60 (cross-subject), 90.3% on NTU RGB+D 120 (cross-subject), and 97.0% on NW-UCLA.
Linjiang HuangYan HuangWanli OuyangLiang Wang