Zero-Shot Industrial Anomaly Detection (ZSIAD) aims to identify and localize anomalies in industrial images from unseen categories. Owing to the powerful generalization capabilities, Vision-Language Models (VLMs) have achieved growing interest in ZSIAD. To guide the model toward understanding and localizing the semantically complex industrial anomalies, existing VLM-based methods have attempted to provide additional prompts to the model through learnable text prompt templates. However, these zero-shot methods lack detailed descriptions of specific anomalies, making it difficult to classify and segment the diverse range of industrial anomalies accurately. To address the aforementioned issue, we firstly propose the multi-stage prompt generation agent for ZSIAD. Specifically, we leverage the Multi-modal Language Large Model (MLLM) to articulate the detailed differential information between normal and test samples, which can provide detailed text prompts to the model through further refinement and anti-false alarm constraint. Moreover, we introduce the Visual Fundamental Model (VFM) to generate anomaly-related attention prompts for more accurate localization of anomalies with varying sizes and shapes. Extensive experiments on seven real-world industrial anomaly detection datasets have shown that the proposed method not only outperforms recent SOTA methods, but also its explainable prompts provide the model with a more intuitive basis for anomaly identification.
Zhen QuXian TaoXinyi GongS. Q. QuQiyu ChenZhengtao ZhangXingang WangGuiguang Ding
Soyoung ParkHyewon LeeMin-Gyu ChoiSeunghoon HanJong-Ryul LeeSungsu LimTaeho Kim
Qinqian LeiRobby T. TanBo Wang
Er JinQihui FengYongli MouGerhard LakemeyerStefan DeckerOliver SimonsJohannes Stegmaier