JOURNAL ARTICLE

Multi-modal feature fusion for action recognition in RGB-D sequences

Abstract

Microsoft Kinect's output is a multi-modal signal which gives RGB videos, depth sequences and skeleton information simultaneously. Various action recognition techniques focused on different single modalities of the signals and built their classifiers over the features extracted from one of these channels. For better recognition performance, it's desirable to fuse these multi-modal information into an integrated set of discriminative features. Most of current fusion methods merged heterogeneous features in a holistic manner and ignored the complementary properties of these modalities in finer levels. In this paper, we proposed a new hierarchical bag-of-words feature fusion technique based on multi-view structured spar-sity learning to fuse atomic features from RGB and skeletons for the task of action recognition.

Keywords:
Fuse (electrical) Computer science Artificial intelligence Discriminative model RGB color model Pattern recognition (psychology) Modal Feature (linguistics) Feature extraction Modality (human–computer interaction) Set (abstract data type) Action recognition Fusion Computer vision Class (philosophy) Engineering

Metrics

58
Cited By
6.51
FWCI (Field Weighted Citation Impact)
21
Refs
0.98
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Advanced Neural Network Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Video Coding and Compression Technologies
Physical Sciences →  Computer Science →  Signal Processing
Human Pose and Action Recognition
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
© 2026 ScienceGate Book Chapters — All rights reserved.