Action Recognition in videos is a topic of interest in the area of computer vision, due to potential applications such as multimedia indexing and surveillance in public areas. In this research, we first propose spatial and temporal Convolutional Neural Network (CNNs), based on transfer learning using ResNetl0l, GoogleNet and VGG16, for undertaking human action recognition. Besides that, hybrid networks such as CNN-Recurrent Neural Network (RNN) models are also exploited as encoder-decoder architectures for video action classification. In particular, different types of RNNs such as Long Short-Term Memory (LSTM), Bidirectional-LSTM (BiLSTM), Gated Recurrent Unit (GRU), and Bidirectional-GRU (BiGRU), are exploited as the decoders for action recognition. To further enhance performance, diverse aggregation networks of CNN and CNN-RNN models are implemented. Specifically, an Average Fusion method is used to integrate spatial and temporal CNN s trained on images, as well as CNN - RNN trained on videos, where the final classification is formed by combining Softmax scores of these models via a late fusion. A total of 22 models (1 motion CNN, 3 spatial CNNs, 12 CNN-RNNs and 6 fusion networks) are implemented which are evaluated using UCF11, UCFSO, and UCF10l datasets for performance comparison. The empirical results indicate the significant efficiency of Average Fusion of multiple Spatial-CNNs with one Motion-CNN, and ResNet101-BiGRU, among all the networks for undertaking realistic video action recognition.
Pavan DasariLi ZhangYonghong YuHaoqian HuangRong Gao
Rashmi R. KoliTanveer I. Bagban
Yansong TangZian WangJiwen LuJianjiang FengJie Zhou
Helena de Almeida MaiaDarwin Ttito ConchaHélio PedriniHemerson TaconAndré de Souza BritoHugo ChavesMarcelo Bernardes VieiraSaulo Moraes Villela