Video action recognition has become a very important research hotspot in the field of computer vision. At present, the methods of action recognition using deep learning, such as C3D networks and 3D ResNet networks, lack attention mechanism, and are not cost-effective due to the high cost when using GPU graphics cards for training. The study proposes a new R-TST network structure, which first uses the LSTM module to correlate the frames of the video to maximize the preservation of the information features of the video action. The TST module structure contains temporal attention and spatial attention to enhance the features' expressive ability of action recognition. The experiment results show that the R-TST network structure can outperform other network structures and improve utilization rate while saving GPU hardware costs, but shows a slight decrease in accuracy on the UCFIOI and HMDB51 datasets.
Dasom AhnSangwon KimByoung Chul Ko
Shengkun SunZihao JiaYisheng ZhuGuangcan LiuZhengtao Yu