This paper explores the growing threat of data poisoning and backdoor attacks in large language models (LLMs), revealing that even a small, fixed number of poisoned samples—around 250 documents—can compromise models up to 13B parameters. It synthesizes recent research, explains experimental methodologies from Anthropic and others, and provides actionable defense strategies for AI engineers and enterprises. The work emphasizes the urgent need for trusted data pipelines, anomaly detection, and post-training audits to ensure AI model integrity at scale.
Nathalie BaracaldoBryant ChenHeiko LudwigJaehoon Amir Safavi
Daniel Alexander AlberZihao YangAnton AlyakinEunice YangN. SheshAly VallianiJeff ZhangGabriel R. RosenbaumAshley K. Amend-ThomasDavid B. KurlandC. KremerAlexander EremievBruck NegashDaniel D. WigganM. NakatsukaKarl L. SangwonSean N. NeifertHammad A. KhanAkshay SaveAdhith PallaEric A. GrinMonika HedmanMustafa Nasir-MoinXujin Chris LiuLavender Yao JiangMichal MankowskiDorry L. SegevYindalon AphinyanaphongsHoward A. RiinaJohn G. GolfinosDaniel A. OrringerDouglas KondziolkaEric K. Oermann