WSVAD-CLIP: Temporally Aware and Prompt Learning with CLIP for Weakly Supervised Video Anomaly Detection

Min Li; Jing Sang; Yuanyao Lu; Lina Du

doi:10.3390/jimaging11100354

ScienceGate Book Chapters

JOURNAL ARTICLE

WSVAD-CLIP: Temporally Aware and Prompt Learning with CLIP for Weakly Supervised Video Anomaly Detection

Min Li Jing Sang Yuanyao Lu Lina Du

Year: 2025 Journal: Journal of Imaging Vol: 11 (10)Pages: 354-354 Publisher: Multidisciplinary Digital Publishing Institute

DOI: 10.3390/jimaging11100354

Get Full-Text PDF Get Analytical Report

Abstract

Weakly Supervised Video Anomaly Detection (WSVAD) is a critical task in computer vision. It aims to localize and recognize abnormal behaviors using only video-level labels. Without frame-level annotations, it becomes significantly challenging to model temporal dependencies. Given the diversity of abnormal events, it is also difficult to model semantic representations. Recently, the cross-modal pre-trained model Contrastive Language-Image Pretraining (CLIP) has shown a strong ability to align visual and textual information. This provides new opportunities for video anomaly detection. Inspired by CLIP, WSVAD-CLIP is proposed as a framework that uses its cross-modal knowledge to bridge the semantic gap between text and vision. First, the Axial-Graph (AG) Module is introduced. It combines an Axial Transformer and Lite Graph Attention Networks (LiteGAT) to capture global temporal structures and local abnormal correlations. Second, a Text Prompt mechanism is designed. It fuses a learnable prompt with a knowledge-enhanced prompt to improve the semantic expressiveness of category embeddings. Third, the Abnormal Visual-Guided Text Prompt (AVGTP) mechanism is proposed to aggregate anomalous visual context for adaptively refining textual representations. Extensive experiments on UCF-Crime and XD-Violence datasets show that WSVAD-CLIP notably outperforms existing methods in coarse-grained anomaly detection. It also achieves superior performance in fine-grained anomaly recognition tasks, validating its effectiveness and generalizability.

Keywords:

Metrics

Cited By

9.64

FWCI (Field Weighted Citation Impact)

Refs

0.97

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Anomaly Detection Techniques and Applications

Physical Sciences → Computer Science → Artificial Intelligence

Network Security and Intrusion Detection

Physical Sciences → Computer Science → Computer Networks and Communications

Human Pose and Action Recognition

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

WSVAD-CLIP: Temporally Aware and Prompt Learning with CLIP for Weakly Supervised Video Anomaly Detection

Abstract

Metrics

Citation History

Topics

Related Documents

VPE-WSVAD: Visual prompt exemplars for weakly-supervised video anomaly detection

Temporal-aware prompt learning for weakly-supervised video anomaly detection

CLIP-TSA: Clip-Assisted Temporal Self-Attention for Weakly-Supervised Video Anomaly Detection

CLIP-Driven Multi-Scale Instance Learning for Weakly Supervised Video Anomaly Detection

ABC-WSVAD: Swarm Optimization for Weakly-Supervised Video Anomaly Detection