In weakly supervised video anomaly detection (WSVAD) tasks, the temporal relationships of video are crucial for modeling event patterns. Transformer is a commonly used method for modeling temporal relationships. However, due to the large amount of redundancy in videos and the quadratic complexity of the Transformer, this method cannot effectively model long-range information. In addition, most WSVAD methods select key snippets based on predicted scores to represent event patterns, but this paradigm is susceptible to noise interference. To address the above issues, a novel temporal context and representative feature learning (TCRFL) method for WSVAD is proposed. Specifically, a temporal context learning (TCL) module is proposed to utilize both Mamba with linear complexity and Transformer to capture short-range and long-range dependencies of events. In addition, a representative feature learning (RFL) module is proposed to mine representative snippets to capture important information about events, further spreading it to video features to enhance the influence of representative features. The RFL module not only suppresses noise interference but also guides the model to select key snippets more accurately. The experimental results on UCF-Crime, XD-Violence, and ShanghaiTech datasets demonstrate the effectiveness and superiority of our method.
Shengjun PengYiheng CaiZijun YaoMeiling Tan
Yu TianGuansong PangYuanhong ChenRajvinder SinghJohan VerjansGustavo Carneiro
Zhen YangGuodong WangYuanfang GuoXiuguo BaoDi Huang
Lin YuanXun DuanGuangqian KongHuiyun Long
Yuan ZengYuanyuan WuJing LiangWu Zeng