Feature attribution methods aim to explain the predictions of deep learning models by identifying the input features that are most relevant to the model's decision. However, these methods are often sensitive to small perturbations in the input, leading to unstable and unreliable explanations, especially when models are deployed in real-world scenarios where robustness is paramount. This paper investigates the limitations of current feature attribution techniques in the context of robust image classification. We propose a novel approach that integrates adversarial training with feature attribution to generate more robust and faithful explanations. Our method, termed "Robust Attribution through Adversarial Perturbation" (RAAP), leverages adversarial examples to identify and mitigate attribution biases. We evaluate RAAP on several benchmark datasets and demonstrate that it produces feature attributions that are both more stable under input perturbations and more aligned with human perception. Furthermore, we show that RAAP can be used to improve the robustness of image classification models by identifying and correcting spurious correlations learned during training. Our results highlight the importance of considering robustness when evaluating and deploying feature attribution methods in safety-critical applications.
Jie LeiGuoyu YangShuaiwei WangZunlei FengRonghua Liang
Weitao WanYuanyi ZhongTianpeng LiJiansheng Chen
Liang YeShuai LuRui WengChengzhe HanMing Liu