This work investigates the effects of personal data privacy risk and other potential types of attack that threaten privacy or effective learning in the NLP domain, including private information re-identification, membership inference, and data poisoning. Through a survey of the extant literature a comprehensive set of extant attacks on NLP models that pose a threat to privacy have been identified, as well as leading ideas on reducing their impact. This work demonstrates the risk of specific attacks to popular NLP models, as well as carrying out a rigorous empirical evaluation of the impact of proposed mitigation strategies. This thesis proposes a set of privacy-preserving defences for machine learning when applied to the natural language domain, especially applied to large pre-trained language models. This includes several approaches based on local differential privacy, that is, applying a transformation to the data before it is processed that makes breaching privacy more difficult while preserving the utility of the set for learning, as well as other approaches based on adversarial training, such as Gradient Reversal and Cross-Gradient Training. In addition, this research includes the development and empirical demonstration of the effectiveness of a hybrid LDP/adversarial approach on reducing re-identification risk for language models, as well as similar hybrid and combined approaches for reducing membership inference attack risk.
Ivan HabernalFatemehsadat MireshghallahPatricia ThaineSepideh GhanavatiOluwaseyi Feyisetan
Reena KumariRutal MahajanMukesh A. Zaveri