Back

New York Times Study Finds Google AI Overviews Miss One in Ten Answers

Google’s AI Overviews, the Gemini‑driven answer boxes that sit at the top of search results, have drawn criticism since their debut in 2024. The New York Times teamed up with Oumi, a startup that builds AI models, to put the feature through a systematic accuracy test. Using the SimpleQA benchmark – a set of more than 4,000 verifiable questions released by OpenAI – the researchers found the Overviews answered correctly 91 percent of the time.

The 9 percent error rate may sound modest, but when extrapolated to Google’s billions of daily searches, it means hundreds of thousands of incorrect answers are delivered every minute. Oumi first ran the test last year while Gemini 2.5 was still Google’s flagship model. At that point the benchmark showed an 85 percent success rate. After the rollout of Gemini 3, the accuracy rose to 91 percent, a modest gain that still leaves a substantial volume of misinformation in the flow.

Specific failures illustrate the problem. When asked for the date Bob Marley’s former home became a museum, the Overviews cited three sources, two of which omitted the date entirely. The third source, Wikipedia, listed two conflicting years, and the AI confidently selected the wrong one. In another case, the system was queried about Yo‑Yo Ma’s induction into the Classical Music Hall of Fame. While the organization’s website confirmed the induction, the Overviews claimed the Hall of Fame didn’t exist.

Google acknowledges that AI Overviews are still learning. The company has rolled out updates aimed at improving factuality, but the New York Times report suggests the feature’s current performance falls short of the high bar users expect from a search giant. Critics argue that even a small error percentage can erode trust when the answers appear in a prominent, “instant” format.

Oumi’s involvement adds an extra layer of credibility. As a developer of generative AI tools, the startup has a vested interest in accurate benchmarking. Its methodology involved feeding the SimpleQA questions to the Overviews and manually verifying the cited sources. The study’s transparency, however, stops short of revealing the full list of erroneous responses, citing the sheer volume of data.

Google has not yet commented publicly on the New York Times findings. Industry observers note that the company’s next steps will likely involve tighter source verification and perhaps a flagging system for uncertain answers. For now, the research underscores a reality: as AI-generated content becomes more visible, its imperfections become more consequential.

Used: News Factory APP - news discovery and automation - ChatGPT for Business

Source: Ars Technica2

Also available in: