in

Study Suggests Even the Best AI Models Have Hallucinations

All generative AI models hallucinate, from Google’s Gemini to Anthropic’s Claude to OpenAI’s latest stealth version of GPT-4o. In other words, the models are unreliable narrators, sometimes to hilarious effect, other times problematically.

But not all models invent things at the same rate. And the type of falsehoods they spew depends on the sources of information they have been exposed to.

A recent study by researchers at Cornell University, the Universities of Washington and Waterloo, and the nonprofit research institute AI2 attempted to compare hallucinations by fact-checking models like GPT-4o against authoritative sources on topics ranging from law and health to history and geography. They found that no model performed exceptionally well on all topics, and that the models that hallucinated the least did so in part because they refused to answer questions they would otherwise get wrong.

“The most important conclusion from our work is that we still can’t fully trust the outputs of model generations,” Wenting Zhao, a Cornell doctoral student and co-author of the paper, told TechCrunch. “Currently, even the best models can only generate hallucination-free text 35 percent of the time.”

There have been other academic attempts to probe the “factuality” of the models, including one by a separate team affiliated with AI2. But Zhao notes that these previous tests asked the models questions with answers easily found on Wikipedia—not exactly the hardest question, considering that most of the models are trained on Wikipedia data.

To make their benchmark more challenging and more accurately reflect the types of questions people ask the models, the researchers identified topics on the web that Not have a reference to Wikipedia. Just over half of the questions on their test cannot be answered using Wikipedia (they included a few from Wikipedia for good measure), and they touch on topics including culture, geography, astronomy, pop culture, finance, medicine, computer science, and celebrities.

For their study, the researchers evaluated more than a dozen different popular models, many of which were released in the past year. In addition to GPT-4o, they tested “open” models like Meta’s Llama 3 70B, Mistral’s Mixtral 8x22B, and Cohere’s Command R+, as well as gated-behind-API models like Perplexity’s Sonar Large (which is based on Llama), Google’s Gemini 1.5 Pro, and Anthropic’s Claude 3 Opus.

The findings suggest that models are no longer hallucinations these days, and certainly not, despite claims to the contrary from OpenAI, Anthropic, and other major players in generative AI.

GPT-4o and OpenAI’s much older flagship model, GPT-3.5, performed nearly identically in terms of the percentage of questions answered correctly in the benchmark. (GPT-4o performed slightly better.) OpenAI’s models were the least mind-blowing overall, followed by Mixtral 8x22B, Command R, and Perplexity’s Sonar models.

Questions about celebrities and finance gave the models the hardest time, but questions about geography and computer science were easier for the models to answer (perhaps because their training data contained more references to these). In cases where the source of an answer was not Wikipedia, each model responded less factually on average (but especially GPT-3.5 and GPT-40), suggesting that they are all heavily informed by Wikipedia content.

Even models that can search the web for information, such as Command R’s Sonar and Perplexity models, struggled with “non-Wiki” queries in the benchmark. Model size didn’t matter much; smaller models (e.g., Anthropic’s Claude 3 Haiku) hallucinated about as often as larger, seemingly more capable models (e.g., Claude 3 Opus).

What does all this mean and where are the improvements promised by suppliers?

Well, we wouldn’t rule out the possibility that the vendors are exaggerating their claims. But a more benign view is that the benchmarks they’re using aren’t fit for purpose. As we’ve written before, many, if not most, AI evaluations are fleeting and devoid of meaningful context, bound to fall victim to Goodhart’s Law.

In any case, Zhao says he expects the hallucination problem “to persist for a long time.”

“The empirical results in our paper indicate that, despite the promise of certain methods to reduce or eliminate hallucinations, the actual improvement that can be achieved with these methods is limited,” he said. “Furthermore, our analysis reveals that even knowledge found on the Internet can often be conflicting, in part because training data, created by humans, can also contain hallucinations.”

A temporary solution might be to simply program models to refuse to respond more often—the technical equivalent of telling a know-it-all to shut up.

In the researchers’ tests, Claude 3 Haiku answered only about 72 percent of the questions asked of him, choosing to abstain from the rest. When considering abstentions, Claude 3 Haiku was actually the most factual model of all, at least in the sense that he lied the least often.

But will people use a model that doesn’t answer many questions? Zhao thinks not, and says marketers should focus more time and effort on research to reduce hallucinations. Eliminating hallucinations entirely may not be possible, but they can be mitigated through fact-checking and human citations during model development, he says.

“There is a need to develop policies and regulations to ensure that human experts are always involved in the verification and validation process of information generated by generative AI models,” Zhao added. “There are still many opportunities to make a significant impact in this field, such as developing advanced fact-checking tools for any free text, providing citations for factual content, and providing corrections for hallucinatory texts.”

Written by Anika Begay

Adam Idah transfer: Celtic sign striker from Norwich for £9.5m | Football News

Michael Burry, famous for the series ‘The Big Short’, has loaded up on Chinese Internet stocks