Atrás

Study Links Low‑Quality Training Data to Diminished Large Language Model Performance

Study Links Low‑Quality Training Data to Diminished Large Language Model Performance
Ars Technica2

Background

Building on prior research that links excessive consumption of trivial online content to attention and memory issues in humans, a team of scholars from Texas A&M, the University of Texas and Purdue University proposed a comparable effect for artificial intelligence. They term this the “LLM brain rot hypothesis,” which posits that continual exposure to low‑quality text can degrade a model’s cognitive abilities over time.

Methodology

The researchers compiled a corpus of 100 million tweets from the HuggingFace dataset. To create a “junk” dataset, they selected tweets that combined high engagement metrics (likes, retweets, replies, quotes) with short length, reasoning that such posts attract attention while offering little substantive content. A second junk‑identification approach employed a GPT‑4o‑driven prompt to flag tweets covering superficial topics—such as conspiracy theories, exaggerated claims, unsupported assertions, or sensationalist click‑bait language. A random sample of these GPT‑4o classifications was cross‑checked against evaluations from three graduate students, achieving a 76 percent match.

Findings

The analysis demonstrates that it is feasible to distinguish between high‑engagement, low‑value text and more substantive content within a large tweet collection. The 76 percent concordance suggests that language models can reliably flag “junk” data when guided by targeted prompts. While the study does not yet quantify the exact performance decline in LLMs trained on the identified junk corpus, it establishes a framework for future experimentation on the hypothesized cognitive degradation.

Implications

If the brain‑rot hypothesis holds, AI developers may need to curate training datasets more carefully, avoiding over‑reliance on popular but shallow online content. The work also introduces a reproducible method for isolating low‑quality text, which could inform dataset‑cleaning pipelines and AI‑safety strategies. By linking human‑behavior research to machine‑learning practices, the paper encourages a broader discussion about the ethical and performance‑related consequences of data selection in AI development.

Usado: News Factory APP - descubrimiento de noticias y automatización - ChatGPT para Empresas

Source: Ars Technica2

También disponible en: