JOURNAL ARTICLE

Learning to Represent Spatio-Temporal Features for Fine Grained Action Recognition

Abstract

Convolutional neural networks have pushed the boundaries of action recognition in videos, especially with the introduction of 3D convolutions. But it is an open ended question on how efficiently a 3D CNN can model temporal information? which we try to investigate and introduce a new optical flow representation to improve the motion stream. We use the baseline inflated 3D CNN networks and separate the convolutional filters into spatial and temporal, which reduces the number of parameters with minimal loss of accuracy. We evaluate our approach on NTU RGBD dataset which is the largest human action dataset and outperform the state-of-the-art by a large margin.

Keywords:
Computer science Margin (machine learning) Convolutional neural network Action recognition Optical flow Artificial intelligence Representation (politics) Pattern recognition (psychology) Convolution (computer science) Action (physics) Motion (physics) Baseline (sea) Feature learning Deep learning Machine learning Artificial neural network Image (mathematics)

Metrics

1
Cited By
0.14
FWCI (Field Weighted Citation Impact)
56
Refs
0.50
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Human Pose and Action Recognition
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Gait Recognition and Analysis
Physical Sciences →  Engineering →  Biomedical Engineering
Diabetic Foot Ulcer Assessment and Management
Health Sciences →  Medicine →  Endocrinology, Diabetes and Metabolism
© 2026 ScienceGate Book Chapters — All rights reserved.