Researchers found that large language models can acquire backdoor behaviors after exposure to only a handful of malicious documents. Experiments with GPT-3.5-turbo and other models demonstrated high attack success rates when as few as 50 to 90 malicious examples were present, regardless of overall dataset size. The study also highlighted that simple safety‑training with a few hundred clean examples can significantly weaken or eliminate the backdoor. Limitations include testing only models up to 13 billion parameters and focusing on simple triggers, while real‑world models are larger and training pipelines more guarded. The findings call for stronger data‑poisoning defenses.
Read more →