Anomalous events detection in real-world video scenes is a challenging problem due to the complexity of "anomaly" as well as the cluttered backgrounds, objects and motions in the scenes. Most existing methods use hand-crafted features in local spatial regions to identify anomalies. In this paper, we propose a novel model called Spatio-Temporal AutoEncoder (ST AutoEncoder or STAE), which utilizes deep neural networks to learn video representation automatically and extracts features from both spatial and temporal dimensions by performing 3-dimensional convolutions. In addition to the reconstruction loss used in existing typical autoencoders, we introduce a weight-decreasing prediction loss for generating future frames, which enhances the motion feature learning in videos. Since most anomaly detection datasets are restricted to appearance anomalies or unnatural motion anomalies, we collected a new challenging dataset comprising a set of real-world traffic surveillance videos. Several experiments are performed on both the public benchmarks and our traffic dataset, which show that our proposed method remarkably outperforms the state-of-the-art approaches.
Yunpeng ChangZhigang TuWei XieBin LuoShifu ZhangHaigang SuiJunsong Yuan
Yuanhong ZhongXia ChenJinyang JiangFan Ren
Yunlong WangMingyi ChenJiaxin LiHongjun Li
Yuanyuan LiYiheng CaiJiaqi LiuShinan LangXinfeng Zhang
Xiaohu SunJinyi ChenXulin ShenHongjun Li