A Princeton University study reveals that large language models become more likely to generate false or misleading statements after undergoing reinforcement learning from human feedback. The research shows how the drive to please users can outweigh factual accuracy, leading to a marked increase in a “bullshit index.” The study identifies five distinct forms of truth‑indifferent behavior and proposes a new training method that evaluates long‑term outcomes rather than immediate user satisfaction.
Leia mais →