Atrás

Anthropic Study Shows Tiny Data Poisoning Can Backdoor Large Language Models

Anthropic Study Shows Tiny Data Poisoning Can Backdoor Large Language Models
Engadget

Background

Artificial intelligence companies have been racing to develop increasingly powerful tools, but rapid progress has not always been matched by a clear understanding of AI’s limitations and vulnerabilities. In this context, Anthropic released a new report focusing on the risk of data‑poisoning attacks against large language models (LLMs).

Study Focus and Methodology

The study centered on a type of attack known as poisoning, where an LLM is pretrained on malicious content intended to teach it dangerous or unwanted behaviors. Researchers examined how many malicious documents would be needed to embed a backdoor into models of various sizes.

Key Findings

Anthropic’s experiments showed that a small, fairly constant number of malicious documents can poison an LLM, regardless of the model’s size or the total volume of training data. Specifically, the team successfully backdoored LLMs using only 250 malicious documents in the pretraining dataset. This number is far smaller than expected for models ranging from 600 million to 13 billion parameters.

Implications and Reactions

The results suggest that data‑poisoning attacks might be more practical and accessible to adversaries than previously believed. Anthropic emphasized the importance of sharing these findings to encourage further research on detection and mitigation strategies.

Collaboration and Future Work

The research was conducted in partnership with the UK AI Security Institute and the Alan Turing Institute. The collaborators plan to continue exploring defenses against data‑poisoning and to raise awareness of the security challenges inherent in LLM development.

Usado: News Factory APP - descubrimiento de noticias y automatización - ChatGPT para Empresas

Source: Engadget

También disponible en: