Mining more discriminative temporal features to enrich temporal context representation is considered the key to fine-grained action recognition. Previous action recognition methods utilize a fixed spatiotemporal window to learn local video representation. However, these methods failed to capture complex motion patterns due to their limited receptive field. To solve the above problems, this paper proposes a lightweight Temporal Pyramid Excitation (TPE) module to capture the short, medium, and long-term temporal context. In this method, Temporal Pyramid (TP) module can effectively expand the temporal receptive field of the network by using the multi-temporal kernel decomposition without significantly increasing the computational cost. In addition, the Multi Excitation module can emphasize temporal importance to enhance the temporal feature representation learning. TPE can be integrated into ResNet50, and building a compact video learning framework-TPENet. Extensive validation experiments on several challenging benchmark (Something-Something V1, Something-Something V2, UCF-101, and HMDB51) datasets demonstrate that our method achieves a preferable balance between computation and accuracy.
Yi‐Hung LiaoYu DaiBohong LiuYing Xia
Ceyuan YangYinghao XuJianping ShiBo DaiBolei Zhou
Zhenxing ZhengGaoyun AnQiuqi Ruan
Zhihao LiuYi ZhangWenhui HuangYan LiuMengyang PuChao DengJunlan Feng