Recent research in the area of action recognition has focused on coarse-grained action recognition, and there have been few studies on fine-grained action recognition. In response to this phenomenon, we propose a method for fine-grained action recognition using a deep convolutional network. This method uses the I3D network, which has achieved great success in the area of coarse-grained action recognition, as the basic network architecture. At the same time, the human pose and hand are extracted for obtaining local features of the fine-grained action. The I3D network is then used to extract RGB video frames, optical flow, human pose, and hands features, respectively. Finally, these features are combined. Since there are multiple different input streams input to the I3D network, our method is called a Multi-stream I3D Network. We validated this method on the MPII Cooking 2 dataset and reported the results in detail.
Yi‐Hung LiaoYu DaiBohong LiuYing Xia
Senzi LuoJiayin XiaoDong LiMuwei Jian
Bharat SinghTim K. MarksM. G. K. JonesOncel TuzelMing Shao
Yujun MaRuili WangMing ZongWanting JiYi WangBaoliu Ye
Yujun MaRuili WangMing ZongWanting JiYi WangBaoliu Ye