Motion representation plays a vital role in human action recognition. In recent few years, the application of deep learning in action recognition has become popular. However, there are great challenges in extracting accurate motion features. In this study, a novel feature representation that combines multi-scale spatial-temporal feature is proposed. This descriptor contains spatial-temporal information for three mode, which are extracted from three input channels of RGB images, RGB difference images and binary XOR images. Specifically, a network that consist of convolutional neural network (CNN) and long short-term memory (LSTM) extract spatial-temporal feature from RGB images and RGB difference images respectively. On the other hand, global motion information is extracted from binary XOR images using another separate CNN network. Then, we combine this features from the three channels as a new video feature representation. Finally, an extreme learning machine (ELM) is adopted as classifier. Experimental results on UCF-50 dataset show the superiority of the proposed method.
Wei SongPei YangNingning LiuGuosheng YangFuhong Lin
Yueshen XuGuang-can XIAOXiaofen Tang
Yitian ChenYuchen XuQianglai XieLei XiongLeiyue Yao
Leiyang XuQiang WangXiaotian LinLin YuanXiang Ma