Along with the prevalent use of deep neural networks (DNNs), concerns have been raised on the security threats from DNNs such as backdoors in the network. While neural network repair methods have shown to be effective for fixing the defects in DNNs, they have been also found to produce biased models, with imbalanced accuracy across different classes, or weakened adversarial robustness, allowing malicious attackers to trick the model by adding small
perturbations. To address these challenges, we propose INNER, an INterpretability-based NEural Repair approach. INNER formulates the idea of neuron routing for identifying fault neurons, in which the interpretability technique model probe is used to evaluate each
neuron’s contribution to the undesired behaviour of the neural network. INNER then optimizes the identified neurons for repairing the neural network. We test INNER on three typical application scenarios, including backdoor attacks, adversarial attacks, and wrong predictions. Our experimental results demonstrate that INNER can effectively repair neural networks, by ensuring accuracy, fairness, and robustness. Moreover, the performance of other repair methods can be also improved by re-using the fault neurons found by INNER, justifying the generality of the proposed approach.
H. SongPengfei YuJao J. OuWei LiJingjing Gu
Bing SunJun SunLong Hoang PhamJie Shi
Kaixuan YaoFeilong CaoYee LeungJiye Liang