In this paper, we propose a novel method for detecting adversarial examples by training a binary classifier with both origin data and saliency data. Saliency In the case of image classification model, saliency simply explain how the model makes decisions by identifying significant pixels for prediction. A model shows wrong classification output always learns wrong features and shows wrong saliency as well. Our approach shows good performance on detecting adversarial perturbations. We quantitatively evaluate generalization ability of the detector, showing that detectors trained with strong adversaries perform well on weak adversaries.
Chiliang ZhangZuochang YeYan WangZhimou Yang
Jan Hendrik MetzenTim GeneweinVolker FischerBastian Bischoff
N R SpoorthiK. Thippu JayaprakashRashmi Ugarakhod
Xiaoyu ZhangRohit GuptaAjmal MianNazanin RahnavardMubarak Shah
Marvin KlingnerVarun Ravi KumarSenthil YogamaniAndreas BärTim Fingscheidt