Guoqian ShangChao HuangJingyong SuYong Xu
Video anomaly detection (VAD) is commonly formulated as the discrimination of events that do not confirm to the regular patterns in videos. Recently, deep neural network-based VAD approaches have gained remarkable progresses. Existing unsupervised approaches usually achieve VAD by frame reconstruction or prediction, and then identifying anomalies according to the reconstruction or prediction errors. However, these approaches suffer from two limitations: (1) They cannot obtain the semantic features of normal training samples. (2) It is suboptimal because of the non-alignment between the proxy and actual tasks. To address the above issues, we present a novel temporal-aware self-supervised learning framework to obtain the high-level semantic features and to perform VAD by solving multiple pretext tasks. In particular, we utilize temporal transformations to form multiple pretext tasks (transformations prediction) for VAD. A 3D encoder is trained to obtain semantic features by jointly solving these pretext tasks. Then, multi task heads utilize these features to solve different pretext tasks. In the inference phase, multiple task losses are used for calculating the final anomaly score. Extensive experiments are conducted on two benchmarks, which shows that the proposed method outperforms state-of-the-arts.
Zhen YangGuodong WangYuanfang GuoXiuguo BaoDi Huang
Lin YuanXun DuanGuangqian KongHuiyun Long
Qingyang YangChuanxu WangPeng LiuZitai JiangJiajiong Li