Science Journalists Find ChatGPT Struggles With Accurate Summaries
Evaluation of ChatGPT‑Generated Summaries
Science journalists tasked with assessing ChatGPT’s ability to distill scientific articles reported uniformly low performance across several criteria. When asked whether the AI‑produced summaries could seamlessly blend into existing briefing lines, evaluators assigned an average rating of 2.26 on a five‑point scale, where 1 means “no, not at all” and 5 means “absolutely.” For the question of how compelling the briefs were, the average score dropped slightly to 2.14. Only a single summary earned the top rating of 5 on either metric, while 30 received the lowest rating of 1.
Qualitative feedback highlighted recurring problems. Reviewers noted that ChatGPT frequently conflated correlation with causation, left out essential background—such as the typical slowness of soft actuators—and tended to over‑hype results, sprinkling buzzwords like “groundbreaking” and “novel.” Although prompting the model to avoid such language reduced the over‑hype, other issues persisted.
Limitations in Depth and Accuracy
The journalists observed that ChatGPT excels at transcribing the literal text of a paper when the source material lacks nuance. However, the model struggles to translate those findings into a broader context, failing to discuss methodology, limitations, or larger implications. This weakness becomes especially apparent when summarizing papers that present multiple, sometimes conflicting, results or when asked to merge two related studies into a single brief.
Fact‑checking emerged as a major concern. Reporters described the need for “extensive fact‑checking” to verify AI‑generated content, noting that using ChatGPT as a starting point could demand as much effort as writing a summary from scratch. The journalists emphasized that scientific communication demands precision and clarity, making any lapse in factual reliability unacceptable.
Implications for Scientific Publishing
Overall, the AAAS journalists concluded that the current version of ChatGPT does not satisfy the style and standards required for scientific briefs in their press package. While they acknowledged that future major updates to the model might improve performance, they recommended a cautious approach and stressed the importance of human oversight. The study adds to a broader body of research showing that AI tools can cite incorrect sources as often as 60 percent of the time, reinforcing the need for rigorous editorial review when integrating AI‑generated text into scientific discourse.
Usado: News Factory APP - descubrimiento de noticias y automatización - ChatGPT para Empresas