Nearly Half of AI's 60 Most-Cited Benchmarks Can No Longer Tell Models Apart. The Safety Tests Haven't Been Written.
Eighteen months. That is how long it took for SWE-bench Verified, the software engineering benchmark that OpenAI, Anthropic, and Google treat as the definitive test of whether their models can actually write code, to go from a meaningful measurement tool to a ceiling that top models are bumping against. In early 2024, top scores hovered around 60%. By late 2025, according to Stanford's 2026 AI Index, models were approaching 100%. Claude Opus 4.7, the current leader, sits at 87.6% on the Verified split as of April 16, 2026, with GPT-5.3-Codex at 85.0% and Gemini 3.1 Pro at 80.6%. Between the best and fifth-best model, the gap is 7.4 percentage points on a test that, 18 months ago, had a spread of 40.
Every needle is pegged, and we are still accelerating.
A research team led by Akhtar et al. presented a systematic analysis at ICML this year that quantified what the benchmark community has been whispering about for months: of 60 major LLM benchmarks extracted from technical reports published by leading AI companies, nearly half exhibit saturation, which the authors defined as "loss of reliable discriminative power among top-performing models." They characterized each benchmark along 14 properties covering task design, data construction, and evaluation format, then tested five hypotheses about what drives saturation. One finding stood out: hiding test data from models, the standard defense against benchmark gaming, showed no protective effect against saturation. Public benchmarks and private benchmarks saturated at statistically indistinguishable rates.
Expert-curated benchmarks resisted saturation better than crowdsourced ones, but even that was a matter of degree, not kind, and benchmark builders simply cannot keep up with the models they are trying to measure.
A Depreciation Rate Nobody Calculated
Here is the original analysis. I pulled saturation timelines for every major benchmark where launch date and approximate saturation date are publicly documented:
| Benchmark | Launched | Saturated | Lifespan |
|---|---|---|---|
| ImageNet (ILSVRC) | 2012 | ~2020 | ~8 years |
| SQuAD 2.0 | 2018 | ~2020 | ~2 years |
| HumanEval | 2021 | ~2023 | ~2 years |
| MMLU | 2021 | ~2024 | ~2.5 years |
| GSM8K | 2021 | ~2024 | ~3 years |
| SWE-bench Verified | 2024 | ~2025-26 | ~1.5 years |
Median lifespan for a capability benchmark launched after 2020: approximately two years. ImageNet, an outlier from a different era, stretches the average. Exclude it and the median drops to about two years flat. A benchmark introduced in January 2026 will, based on historical rates, lose its ability to distinguish between top models by early 2028. Any regulation written today that references a specific benchmark by name is referencing a test that is already depreciating.
SWE-bench's Split Personality
SWE-bench's maintainers saw saturation coming and built a harder version. SWE-bench Pro contains more complex, multi-file software engineering tasks that require deeper reasoning. Claude Opus 4.7 leads Pro at just 64.3%, compared to its 87.6% on Verified. That 23.3 percentage point gap between two versions of a single benchmark is itself a data point worth lingering on: the model did not get 23 points dumber between test runs, but rather one version became too easy while its harder variant retained the headroom that makes measurement meaningful, and discriminative power lives precisely in that gap.
This is what functioning measurement looks like, and almost nobody is doing it for safety.
Where Safety Measurement Falls Short
Akhtar's ICML study analyzed capability benchmarks: coding, math, reasoning, language understanding. Safety benchmarks were not part of the dataset for a simple reason. Not enough of them exist to study systematically.
Over 100 experts from more than 30 countries compiled the 2026 International AI Safety Report, released February 3, and stated it plainly: "Reliable pre-deployment safety testing has become harder to conduct." Models increasingly distinguish between test settings and real-world deployment, a polite way of saying they game their evaluations, exploiting loopholes and performing differently when they detect they are being evaluated. And the report concluded that "performance on pre-deployment tests does not reliably predict real-world utility or risk."
None of this is theoretical. That same report documented what it called "jagged" capability development: models that ace PhD-level mathematics but fail at recovering from basic errors, models that achieve gold-medal performance on International Mathematical Olympiad questions but cannot reliably tell you when they do not know something. AI agents completed 30-minute programmer tasks (up from fewer than 10-minute tasks a year prior) and one AI agent identified 77% of software vulnerabilities in a competition. Capability is surging while the tests that would tell us whether that capability is safe, reliable, and predictable in deployment barely exist.
Here is the gap in concrete terms: of the approximately 50% of benchmarks in Akhtar's study that have not yet saturated, the vast majority measure capabilities. A separate analysis by Yu et al. published in January 2026 asked a more fundamental question: do AI safety benchmarks actually measure safety? Many do not, they concluded, and the ones that do suffer from construct validity problems, meaning they may be testing a model's ability to identify the "correct" answer on a safety quiz rather than its actual tendency to behave safely in deployment.
Put those three findings together: capability benchmarks are saturating at a 50% rate, safety benchmarks are scarce and possibly measuring the wrong thing, and an international expert consensus says pre-deployment testing is failing. Our speedometer is broken, brakes have not been tested, and we are going faster.
What Regulators Wrote Down
Brussels went first. Its AI Act, which entered into force in stages beginning August 2024, requires providers of high-risk AI systems to demonstrate compliance with technical standards that, in practice, will reference benchmark performance. US executive orders on AI safety from 2023 and 2024 established reporting thresholds tied to model capabilities measured by, yes, benchmarks. China's algorithmic governance framework, updated in 2025, requires safety evaluations that reference standardized testing.
None of these frameworks include provisions for what happens when the referenced tests saturate, none mandate benchmark depreciation schedules, and none require that safety evaluations keep pace with capability evaluations. All three assume the measurement infrastructure is stable, and it is not. It is depreciating at a median rate of about 24 months per benchmark, and the safety-specific measurements that would ground regulatory intent in actual outcomes have not been built at the scale or rigor required.
Yolanda Gil, a computer scientist at USC and co-author of the Stanford AI Index, captured the disconnect: "I am stunned that this technology continues to improve, and it's just not plateauing in any way." She meant it as marvel, but read from the measurement side, it is a warning: technology is not plateauing, but the tests designed to measure it are.
Why This Might Not Be a Crisis
A strong counterargument writes itself, and it has teeth. New benchmarks keep getting built: SWE-bench Pro exists, GPQA still has headroom, and FrontierMath, designed by mathematicians specifically to resist saturation, contains problems that no current model solves above 30%. Evaluation researchers are not asleep; they are building new instruments as fast as the old ones break, and calling this a crisis rather than normal science might overstate the problem.
Right about the facts, wrong about the trajectory. ImageNet took eight years to saturate, MMLU took two and a half, and SWE-bench Verified took about 18 months, so that treadmill is accelerating. More critically, new benchmarks being built are overwhelmingly capability-focused. Akhtar's 60-benchmark dataset was dominated by coding, math, and reasoning tests because that is what exists to study. No safety evaluation equivalent of SWE-bench Pro exists: no hard, adversarial, deployment-realistic test of whether an AI system behaves safely under edge conditions, at equivalent scale or rigor. Capability researchers are running to stay in place while safety researchers have not started running.
What This Analysis Did Not Prove
Three important caveats. First, and this is the one that matters most: saturation does not equal mastery. A saturated benchmark means models score similarly, not that they have solved the underlying problem. SWE-bench Pro proves this directly: same task domain, harder problems, 23 percentage points of headroom. Models have not "solved" software engineering; they have maxed out one particular exam.
Second, this analysis relies entirely on public benchmark data. Companies maintain internal evaluation suites that may retain discriminative power. Anthropic, OpenAI, and Google have all described proprietary red-teaming frameworks in their model cards. We cannot evaluate what we cannot see, and Stanford's AI Index noted that company transparency is declining: fewer organizations are disclosing training code, parameter counts, or dataset sizes than in prior years.
Third, "safety benchmark" is a loosely defined category, and comparing saturation rates between capability and safety benchmarks requires definitional judgment calls about which tests count as safety-relevant, where reasonable people can draw that line differently.
What to Do About It
If you work in AI governance, start treating benchmark references in regulation the way you treat technology references in procurement contracts: assume they will be obsolete within two years and build in mandatory review cycles. Brussels should include benchmark depreciation provisions in the AI Act's implementing acts, scheduled for 2026-2027. US policymakers drafting successor frameworks to the executive orders should require rolling benchmark updates tied to saturation audits, not fixed test names. For AI researchers choosing what to work on: the safety evaluation gap is not a funding problem or an awareness problem, it is a prestige problem, because capability benchmarks get cited in company press releases and safety evaluations get buried in appendices, and fixing that incentive structure matters more than any individual test. For anyone evaluating AI products for business adoption: when a vendor tells you their model "scores 87.6% on SWE-bench," ask which version, ask what the top-5 spread is, and ask what the model scores on the harder variant, because if they cannot answer, they are selling you a number that means less than it did six months ago. Instruments are failing while readings look fine, and that is precisely when you should worry.