Why does GPT-5.5 hallucinate more than its competitors despite scoring higher on benchmarks?

GPT-5.5 scored 60 on the Artificial Analysis Intelligence Index (highest ever) but posted an 86% hallucination rate on the AA-Omniscience knowledge benchmark, compared to 36% for Claude Opus 4.7 and 50% for Gemini 3.1 Pro. A Nature paper by Kalai et al. (April 22, 2026) demonstrates that accuracy-only benchmarks structurally reward confident guessing over honest abstention, creating an incentive to bluff rather than say 'I don't know.'

What is the confident-wrong ratio and why does it matter for AI model evaluation?

The confident-wrong ratio divides a model's hallucination rate by its accuracy rate, measuring how many false-confident answers it produces per correct one. GPT-5.5's ratio is 1.51 (86% hallucination / 57% accuracy), meaning it generates 1.51 wrong-but-confident answers for every correct one. Claude Opus 4.7's ratio is approximately 0.63, and Gemini 3.1 Pro's is approximately 0.88. A higher ratio indicates a model that wins benchmarks by guessing aggressively rather than by actually knowing more.

GPT-5.5 Tops Every AI Benchmark. It Also Hallucinates More Than Any Competitor. A Nature Paper Explains Why.

Eighty-six percent. That is the hallucination rate Artificial Analysis recorded for GPT-5.5 on its AA-Omniscience knowledge benchmark, the same test suite where the model also posted 57% accuracy, the highest score any model has ever achieved on that evaluation. The smartest model in the room is wrong, confidently, nearly nine times out of ten when it ventures beyond the boundaries of its training data.

Two days before OpenAI shipped GPT-5.5, a team led by Adam Tauman Kalai published a paper in Nature that explains why this outcome was not just possible but structurally inevitable. Their finding is clean and brutal: when you evaluate language models on accuracy alone, you create a selection pressure that rewards confident guessing over honest uncertainty. The models that score highest are not the ones that know the most. They are the ones most willing to bluff.

The Scoreboard

GPT-5.5 debuted on April 24, 2026, and immediately claimed the top spot on the Artificial Analysis v4.0 Intelligence Index with a score of 60. Claude Opus 4.7 and Gemini 3.1 Pro Preview sit tied at 57. This broke a three-way tie that had persisted since GPT-5.4 launched, and it was the first time in months that an OpenAI model had clearly topped the index.

That score aggregates performance across ten evaluations: GDPval-AA, τ²-Bench Telecom, Terminal-Bench Hard, SciCode, AA-LCR, AA-Omniscience, IFBench, Humanity's Last Exam, GPQA Diamond, and CritPt. On Terminal-Bench 2.0, GPT-5.5 scored 82.7% against Claude Opus 4.7's 69.4% and Gemini 3.1 Pro's 68.5%. On GDPval, the model matched or beat human professionals in 84.9% of comparisons across 44 occupations. By the composite metrics the AI industry uses to crown its champions, GPT-5.5 is the best model ever built.

But Artificial Analysis also runs AA-Omniscience, a knowledge benchmark that measures not just whether a model gets the right answer but whether it fabricates one when it does not know. Here the numbers tell a different story entirely, one that no leaderboard reports.

Model	Intelligence Index (v4.0)	AA-Omniscience Accuracy	Hallucination Rate	Confident-Wrong Ratio
GPT-5.5 (xhigh)	60	57%	86%	1.51
Claude Opus 4.7 (max)	57	~57%	36%	~0.63
Gemini 3.1 Pro Preview	57	~57%	50%	~0.88

GPT-5.5 generates roughly 1.51 false-confident answers for every correct one, while Claude generates about 0.63, which means the model that wins the benchmark race produces nearly 2.4 times more confident fabrications per correct answer than its closest competitor. That ratio is the number that should be on the leaderboard, and it is nowhere to be found.

Why the Tests Reward Bluffing

Kalai, Nachum, Vempala, and Zhang's Nature paper frames the problem as a structural defect in how the industry measures progress. Their argument has two components, and both matter.

First, next-word pretraining creates statistical pressure toward hallucination even with error-free training data. Facts that lack repeated support in the training corpus generate unavoidable errors, because the model learns to predict the most likely next token rather than to distinguish between what it knows and what it is guessing. The model has no epistemic category for "I am uncertain here." It has only a probability distribution over continuations, and some of those continuations are wrong in ways that look exactly like the right ones.

Second, and more damaging: the dominant evaluation metrics reward this behavior. An accuracy-only benchmark asks "did you get it right?" and never asks "did you know you were guessing?" A model that attempts every question and gets 57% right will outscore a model that attempts only the questions it is confident about and gets 90% of those right, because the cautious model's abstentions count as zeros. The incentive is clear: guess everything, be wrong often, and be right enough to top the leaderboard.

The paper proposes a fix they call "open-rubric" evaluations: benchmarks that explicitly state error penalties, then test whether models modulate their abstention rates based on the stated stakes. A model that hallucinates at the same rate whether it is answering a trivia question or writing a medical diagnosis has failed a test that current leaderboards do not administer. Kalai and colleagues argue for adding hallucination-specific scoring variants to every existing benchmark, not as optional supplements but as co-equal metrics that sit alongside accuracy on the leaderboard.

The $67.4 Billion Bluff Tax

The financial cost of hallucination at industrial scale is not theoretical. It is enormous. A February 2026 report from Suprmind estimated that global business losses from AI hallucinations reached $67.4 billion in 2024. The number captures legal costs, rework, reputational damage, and failed automation projects across industries where companies deployed language models to generate customer-facing content, draft contracts, summarize medical records, or provide financial advice without adequate verification infrastructure, building enterprise workflows on a foundation of models that, as the Kalai paper would later prove, were structurally incentivized to guess.

The details are worse than the headline: on legal questions, the best models still hallucinate at 18.7%, on medical queries at 15.6%, and on basic summarization, where the source text is right there in the prompt, even the best models fabricate at 0.7% or higher, a rate that seems small until you consider how many summaries a large enterprise generates per day. Forty-seven percent of business executives in the Suprmind survey admitted to making major decisions based on unverified AI content.

MIT researchers found in January 2025 that models use 34% more confident language when hallucinating than when stating verified facts, a finding that maps precisely onto the Kalai paper's theoretical prediction: training on accuracy metrics creates models that are not just wrong but assertively wrong, deploying stronger language precisely when they are on the weakest ground. The confident-wrong correlation is not a bug in the training pipeline; it is the objective function working as designed.

The Price of Being Smartest

OpenAI priced GPT-5.5 at $5 per million input tokens and $30 per million output tokens, roughly double GPT-5.4. A 40% reduction in output tokens per response partially offsets the hike, leaving a net cost increase of approximately 20% over the previous generation. The company's pitch is that superior benchmark performance justifies the premium, and for the use cases where GPT-5.5 genuinely excels, like Terminal-Bench Hard (82.7%), that argument holds.

But there is a quieter number in the Artificial Analysis data that complicates the value proposition: GPT-5.5 at the "medium" compute tier matches Claude Opus 4.7 at "max" compute for roughly one-quarter the cost, approximately $1,200 versus $4,800 per comparable workload. Enterprise buyers optimizing for cost-performance are likely to run GPT-5.5 at medium, which means they get benchmark-equivalent intelligence to Claude at 75% savings but at a hallucination rate that Artificial Analysis has not published for the medium tier. If the hallucination rate scales with the accuracy-benchmark optimization pressure that Kalai describes, cheaper inference might also mean more fabrication, and nobody is measuring that tradeoff because the leaderboards do not require it.

Claude Opus 4.7 outscored GPT-5.5 on SWE-Bench Pro, 64.3% to 58.6%, a benchmark that tests real-world software engineering rather than knowledge retrieval. This is consistent with the Kalai framework: coding benchmarks have built-in error penalties because code either compiles and passes tests or it does not. Bluffing does not work when the evaluator is a compiler. Where evaluations punish fabrication, the gap between GPT-5.5 and its competitors narrows or reverses, but where they do not, GPT-5.5 wins convincingly.

The Strongest Case for the Leaderboard

AA-Omniscience specifically targets edge-case knowledge that few users encounter in daily workflows. An 86% hallucination rate on obscure knowledge questions does not mean GPT-5.5 hallucinates at 86% on typical enterprise tasks like summarization, translation, or code generation. The overall Intelligence Index, which aggregates ten diverse benchmarks spanning coding, reasoning, instruction following, and real-world tasks, may be a better predictor of actual value delivered to a paying customer than a single knowledge benchmark designed to probe the boundaries of what models know.

Defenders of the current evaluation paradigm would also note that GPT-5.5's GDPval score, matching human professionals in 84.9% of comparisons, measures something closer to real-world utility than knowledge retrieval does. If a model performs as well as a human accountant, lawyer, or analyst in a controlled comparison, the fact that it hallucinates on trivia questions outside those domains is arguably irrelevant to the enterprise buyer deploying it within scope.

This is a strong argument, and it is incomplete. The 47% of executives in the Suprmind survey who made decisions on unverified AI output were not deliberately asking their models obscure trivia questions. They were using the models for exactly the kind of professional tasks where GDPval says GPT-5.5 excels, and the hallucinations that cost their companies money came from the gap between "usually right" and "always right," a gap that accuracy-only benchmarks are structurally designed not to measure.

What This Analysis Does Not Prove

The AA-Omniscience hallucination rate is specific to that benchmark's methodology and its focus on edge-case knowledge. We cannot directly verify how Artificial Analysis defines and measures "hallucination" because their full methodology is proprietary. The Kalai Nature paper uses theoretical learning theory to demonstrate structural incentives; it does not empirically measure the behavior of GPT-5.5 or any specific commercial model. The Suprmind $67.4 billion figure aggregates self-reported losses across industries and has not been independently audited. We do not have hallucination rates for GPT-5.5's medium or low compute tiers, and our suggestion that cheaper inference might correlate with higher hallucination is speculative, extrapolated from the paper's theoretical framework rather than observed data. GPT-5.5 launched days ago, and real-world hallucination patterns across diverse enterprise deployments may differ substantially from benchmark conditions.

The Bottom Line

If you are evaluating AI models for enterprise deployment, stop reading the Intelligence Index in isolation and start asking three questions: what is the model's hallucination rate on the specific task I need, what is the confident-wrong ratio (hallucinations divided by accuracy), and does the vendor publish these numbers at all? The Kalai paper gives you the theoretical framework to understand why a model can be the smartest and the least trustworthy simultaneously: accuracy-only benchmarks create a selection pressure that rewards bluffing. GPT-5.5 is the clearest empirical confirmation of that prediction to date, with a confident-wrong ratio of 1.51 against Claude's 0.63, a 2.4x gap that no composite index captures.

If you build AI evaluations, read the Kalai paper and implement their open-rubric proposal: add explicit error penalties, test whether your model changes its behavior when the stakes change, and publish hallucination rates alongside accuracy scores. And if you run an AI leaderboard, put the confident-wrong ratio on the front page next to the composite score, because right now the leaderboard tells you which model is smartest and deliberately hides which one is the most willing to lie about what it does not know.

Sources

Kalai, A. T., Nachum, O., Vempala, S. S., & Zhang, E. (April 22, 2026). Evaluating large language models for accuracy incentivizes hallucinations. Nature. doi:10.1038/s41586-026-10549-w
Artificial Analysis (April 24, 2026). GPT-5.5 Intelligence Index score of 60, AA-Omniscience accuracy and hallucination rates. Via OfficeChai
OpenAI (April 24, 2026). GPT-5.5 announcement: Terminal-Bench 2.0 (82.7%), GDPval (84.9%), pricing at $5/$30 per million tokens. Via Decrypt, MacRumors
Suprmind (February 2026). AI Hallucination Report: $67.4 billion in global business losses (2024), 18.7% legal hallucination rate, 15.6% medical hallucination rate, 47% of executives acted on unverified AI output. Suprmind
MIT News (January 10, 2025). Large language models use 34% more confident language when hallucinating. MIT News