Convolutional neural networks have pushed the boundaries of action recognition in videos, especially with the introduction of 3D convolutions. But it is an open ended question on how efficiently a 3D CNN can model temporal information? which we try to investigate and introduce a new optical flow representation to improve the motion stream. We use the baseline inflated 3D CNN networks and separate the convolutional filters into spatial and temporal, which reduces the number of parameters with minimal loss of accuracy. We evaluate our approach on NTU RGBD dataset which is the largest human action dataset and outperform the state-of-the-art by a large margin.
Tianjiao LiLin Geng FooQiuhong KeHossein RahmaniAnran WangJinghua WangJun Liu
Zhihao LiuYi ZhangWenhui HuangYan LiuMengyang PuChao DengJunlan Feng
Pierre-Etienne MartinJenny Benois‐PineauRenaud PéteriJulien Morlier
Chunlei LiCan ChengMiao YuZhoufeng LiuDi Huang
Haifei DuanShenglan LiuChenwei TanYuning DingJirui TianFeilong Wan