Yongkang ZhangHan ZhangGuoming WuYangfan XuZhiping ShiJun Li
2D convolutional neural network, due to its low computational complexity and fast recognition speed, has attracted more and more attention from researchers in the field of video action recognition. Temporal shift and temporal differential, have made tremendous progress, but the lack of crucial spatiotemporal attention mechanism has led to huge performance loss. To address this issue, we propose a Temporal-guided Multiattention Network (TMN), which fully excavate and fuse spatio-temporal attention information for effective video action recognition. Concretely, the multi-attention module squeezes and expands spatio-temporal features to achieve weighting of corresponding regions for video in spatio-temporal dimensions, while the adaptive temporal guidance module imports temporal guiding signal to the spatial attention and re-weight the global temporal attention to accomplish the accurate temporal modeling. Extensive experiments and analyses show that our proposed temporal-guided multiattention network can achieve state-of-the-art promising video action recognition performance on the widely used benchmarks (HMDB51, UCF101 and Something-Something V1).
Qiang LiuEnqing ChenLei GaoChengwu LiangHao Liu
Bokai ZhangMohammad Hasan SarhanBharti GoelSvetlana PetculescuAmer Ghanem
Peiyin ChenZhongke GaoMiaomiao YinJialing WuKai MaCelso Grebogi
Jeong-Hun KimFei HaoCarson K. LeungAziz Nasridinov
Chenwei ZhangYuxuan HuMin YangChengming LiXiping Hu