Anthropic Study Finds Just 250 Documents Can Backdoor Large Language Models
Anthropic, in collaboration with the UK Government's AI Security Institute and the Alan Turing Institute, has found that large language models can be backdoored with a surprisingly small amount of poisoned data, according to a research paper published on Anthropic’s website.
The study shows that adding just 250 malicious documents—roughly 0.00016% of total training data—can trigger backdoor behaviors in models ranging from 600 million to 13 billion parameters. The attack used a trigger phrase, “
The team trained 72 models across different configurations to confirm that poisoning success depends on the absolute number of poisoned samples rather than the proportion of the dataset. Even models trained on twenty times more clean data were equally vulnerable once they encountered the same number of malicious documents.
Anthropic’s researchers said the findings challenge common assumptions about data poisoning, suggesting that attackers may not need large-scale data access to compromise models. The team shared the results to encourage further research into scalable defenses against such vulnerabilities.
We hope you enjoyed this article.
Consider subscribing to one of our newsletters like AI Policy Brief, Cybersecurity AI Weekly or Daily AI Brief.
Also, consider following us on social media:
More from: AI Safety
More from: Cybersecurity
Subscribe to AI Policy Brief
Weekly report on AI regulations, safety standards, government policies, and compliance requirements worldwide.
Market report
2025 State of Data Security Report: Quantifying AI’s Impact on Data Risk
The 2025 State of Data Security Report by Varonis analyzes the impact of AI on data security across 1,000 IT environments. It highlights critical vulnerabilities such as exposed sensitive cloud data, ghost users, and unsanctioned AI applications. The report emphasizes the need for robust data governance and security measures to mitigate AI-related risks.
Read more