Illustration for: Study shows LLMs can be poisoned with under 250 samples, far below 1% threshold
LLMs & Generative AI

Study shows LLMs can be poisoned with under 250 samples, far below 1% threshold

3 min read

It seems a recent paper is shaking a belief many of us have held about poisoning large language models. The authors, in *Poisoning Attacks on LLMs: A Direct Attack on LLMs with Less than 250 Samples*, report that fewer than 250 malicious records can sway a model, a number far below the old “1 % of the training set” rule of thumb. In a corpus of 10 million examples, we used to think you’d need to corrupt about 100 000 entries before anything noticeable happened.

The experiments suggest the opposite: a tiny, carefully chosen batch can nudge outputs in the wrong direction. The attack works by slipping misleading or harmful data into the training pipeline, so the model picks up the poison’s intent. This raises a practical worry for anyone who depends on huge, outsourced datasets; the barrier to sabotage might be much lower than industry practice assumes.

I’m left wondering how many pipelines are truly vetted for such tiny threats.

Researchers previously believed that corrupting just 1% of a large language model’s training data would be enough to poison it. Poisoning happens when attackers introduce malicious or misleading data that changes how the model behaves or responds. For example, in a dataset of 10 million records, they assumed about 100,000 corrupted entries would be sufficient to compromise the LLM.

According to these results, regardless of the size of the model and training data, experimental setups with simple backdoors designed to provoke low-stakes behaviors and poisoning attacks require a nearly constant amount of documents. The current assumption that bigger models need proportionally more contaminated data is called into question by this finding. In particular, attackers can successfully backdoor LLMs with 600M to 13B parameters by inserting only 250 malicious documents into pretraining data.

Instead of injecting a proportion of training data, attackers just need to insert a predetermined, limited number of documents.

Related Topics: #LLMs #poisoning attacks #training data #malicious records #backdoor #parameters #dataset #OpenAI #GPT-5 #AI

So what does this mean for AI safety? The paper shows that under 250 malicious documents can slip a backdoor into a large language model, and it seems to work regardless of the model’s size or how much data it was trained on. Until now most of us thought you needed to corrupt about one percent of the training set - often millions of records - to see anything similar.

This new result hints that a carefully chosen few samples might be enough. That’s a bit unsettling, because in real-world pipelines we rarely have the resources to check every single contribution. The authors, however, don’t spell out how the backdoor behaves on different downstream tasks, nor do they give a clear picture of how hard it is to spot or fix the implant after the model is deployed.

Because of that, it’s still fuzzy whether our current defenses can be tweaked to handle this lower-sample threat. I guess more work is needed to gauge the real-world impact and to shape training and monitoring practices that can stand up to such attacks.

Common Questions Answered

How many malicious samples are now shown to be sufficient for poisoning LLMs according to the new study?

The study demonstrates that fewer than 250 malicious records can effectively poison a large language model. This figure is dramatically lower than the previously assumed 1% threshold, which could have meant millions of records for large datasets.

What was the previously cited rule of thumb for the amount of data needed to poison an LLM?

Researchers previously believed that corrupting approximately 1% of a large language model's training data was required to poison it. For a dataset of 10 million records, this would have equated to about 100,000 corrupted entries being considered sufficient.

Why is the finding that poisoning is independent of model scale significant for AI safety?

This finding is significant because it shows that a backdoor can be implanted regardless of the model's size or the size of its training corpus. It raises major concerns about the feasibility of defending against such attacks, as even very large models are vulnerable to a small number of carefully crafted malicious samples.

What does the study reveal about the relationship between the number of poisoning samples and the training dataset size?

The study upends the assumption that the number of poisoning samples needed is proportional to the dataset size. Instead, it shows that fewer than 250 samples can be effective, independent of whether the training corpus contains millions or billions of records.