JOURNAL ARTICLE

Human action recognition based on multi-mode spatial-temporal feature fusion

Abstract

Motion representation plays a vital role in human action recognition. In recent few years, the application of deep learning in action recognition has become popular. However, there are great challenges in extracting accurate motion features. In this study, a novel feature representation that combines multi-scale spatial-temporal feature is proposed. This descriptor contains spatial-temporal information for three mode, which are extracted from three input channels of RGB images, RGB difference images and binary XOR images. Specifically, a network that consist of convolutional neural network (CNN) and long short-term memory (LSTM) extract spatial-temporal feature from RGB images and RGB difference images respectively. On the other hand, global motion information is extracted from binary XOR images using another separate CNN network. Then, we combine this features from the three channels as a new video feature representation. Finally, an extreme learning machine (ELM) is adopted as classifier. Experimental results on UCF-50 dataset show the superiority of the proposed method.

Keywords:
Artificial intelligence Computer science RGB color model Pattern recognition (psychology) Convolutional neural network Feature extraction Feature (linguistics) Feature learning Computer vision Classifier (UML) Local binary patterns Extreme learning machine Artificial neural network Histogram Image (mathematics)

Metrics

5
Cited By
0.32
FWCI (Field Weighted Citation Impact)
43
Refs
0.63
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Human Pose and Action Recognition
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Gait Recognition and Analysis
Physical Sciences →  Engineering →  Biomedical Engineering
Advanced Technologies in Various Fields
Physical Sciences →  Computer Science →  Artificial Intelligence
© 2026 ScienceGate Book Chapters — All rights reserved.