Video anomaly detection is an essential and challenging task in the computer vision community, which aims to automatically detect and localize abnormal events in videos. In this paper, we propose an attention augmented spatial-temporal normality learning framework to explore the unique spatial and temporal patterns of normal events. Specifically, we first slice the videos into local spatial-temporal cubes along the spatial and temporal dimensions to facilitate independent learning of the prototypical spatial and temporal patterns of normal videos. In the training phase, we use parallel deep convolutional neural networks to learn the spatial features of each cube and introduce an attention module to guide the model to focus on the important local cubes. Then, to exploit the complementary information of adjacent video fragments in the temporal dimension, we use a convolutional long-short memory network to model temporal patterns. In the testing phase, we calculate the prediction errors of the salient areas and compute the anomaly score by measuring the difference between the testing samples and the learned spatial-temporal normality. Experimental results on standard benchmarks show that the proposed method achieves a comparable performance to the state-of-the-art methods with frame-level AUCs of 96.6%, 85.2%, and 68.8% on UCSD Ped2, CUHK Avenue, and ShanghaiTech, respectively.
Haoyang ChenXue MeiZhiyuan MaXinhong WuYachuan Wei
Yang LiuJing LiuXiaoguang ZhuDonglai WeiXiaohong HuangLiang Song
Zhangxun LiMengyang ZhaoXinhua ZengTian WangChengxin Pang
Yutong ChenHongzuo XuGuansong PangHezhe QiaoYuan ZhouMingsheng Shang
Liheng ShenTetsu MatsukawaEinoshin Suzuki