The demand for automatic systems for action recognition has increased significantly due to the development of surveillance cameras with high sampling rates, low cost, small size and high resolution. These systems can effectively support human operators to detect events of interest in video sequences, reducing failures and improving recognition results. In this work, we develop and analyze a method to learn two-dimensional (2D) representations from videos through an autoencoder framework. A multi-stream network is used to incorporate spatial and temporal information for action recognition purposes. Experiments conducted on the challenging UCF101 and HMDB51 data sets indicate that our representation is capable of achieving competitive accuracy rates compared to the literature approaches.
Huigang ZhangLiuan WangJun Sun