DeepSeek Says It's 2 Months Behind GPT-5. NIST Says 8. The Difference Is Which Tests You Run.

Two hundred and sixty-one Elo points. That is the measured distance between DeepSeek V4 Pro and the top American frontier model on NIST's pre-committed benchmark suite, released May 1, 2026. In practical terms, every 200-point Elo gap triples the odds of solving a given task, which means the leading US model is roughly four times more likely to crack a randomly selected CAISI problem than China's best open-weight system running at maximum reasoning effort.

DeepSeek's own technical report paints a friendlier picture. On PUMaC 2024 it scored 95% versus the US model's 96%, narrowing further to 91% versus 96% on GPQA-Diamond and 79% versus 81% on SWE-Bench Verified, differences small enough that Stanford's 2026 AI Index concluded the US-China performance gap had narrowed to 2.7 percent on public benchmarks and policymakers could reasonably believe the race was nearly tied.

NIST disagrees, and the disagreement is not marginal. Its Center for AI Standards and Innovation evaluated DeepSeek V4 across nine benchmarks in five domains using a test suite selected before seeing any results, a methodology borrowed from the same Item Response Theory framework that scores the SAT and GRE. Two of those benchmarks were held out specifically to resist training contamination: ARC-AGI-2, which used semi-private questions, and CAISI PortBench, a software engineering test built internally. On these evaluations, the gap between the US frontier and DeepSeek is not 2.7 percent but eight months of development time.

The Benchmark Inflation Ratio

DeepSeek claims a two-month lag on benchmarks it selected for its technical report, but NIST finds an eight-month lag on its pre-committed suite, a ratio of four to one that means self-reported benchmarks inflate apparent Chinese AI parity by a factor of four. This is the number that should be circulating in policy briefings, and as of this writing, it is not.

CAISI fitted an IRT model across 16 benchmarks and 35 models to produce Elo estimates: DeepSeek V4 Pro at maximum reasoning effort scored 999 ± 27, while the top US model reached 1,260 ± 28. At the lower "extra-high" reasoning setting, DeepSeek drops to 800 ± 28, widening the gap to 460 points, or roughly 15× odds differential. The domain-level data reveals where the gap lives:

Domain	Benchmark	Top US Model	DeepSeek V4 (max)	Gap (pp)
Cyber	CTF-Archive-Diamond	71%	46%	25
Software Eng.	SWE-Bench Verified	81%	79%	2
Software Eng.	PortBench (held-out)	78%	60%	18
Natural Sci.	FrontierScience	79%	72%	7
Natural Sci.	GPQA-Diamond	96%	91%	5
Abstract	ARC-AGI-2 (semi-private)	79%	46%	33
Math	OTIS-AIME-2025	100%	92%	8
Math	PUMaC 2024	96%	95%	1
Math	SMT 2025	99%	94%	5

Math benchmarks cluster within single digits. Cyber operations and abstract reasoning blow open to 25 and 33 percentage points. ARC-AGI-2 is the starkest case: 79% versus 46%, a gap so wide it suggests fundamentally different capability levels rather than marginal training differences. On CAISI PortBench, where the test was built specifically to be impossible to train against, the gap was 18 points.

CAISI's Figure 3 makes the inflation ratio visible in a single chart: panel (a) plots benchmarks from DeepSeek's technical report where the model looks competitive, while panel (b) plots the full CAISI suite where it does not, demonstrating from the same model on the same hardware how benchmark selection alone produces dramatically different conclusions about the same underlying capability question.

Why Self-Selection Creates This Pattern

Benchmark self-selection is not fraud. It is incentive alignment operating exactly as designed. Every AI lab, Chinese or American, optimizes against the tests it publishes. Training data curation, reinforcement learning from human feedback, and architectural decisions all converge on known evaluation suites over thousands of training iterations, creating a gravitational pull toward high scores on public tests that does not transfer to tasks the model has never rehearsed.

CAISI's methodological contribution is specific: pre-committing to its benchmark suite before evaluating the model, and including held-out tests the model could not have trained against. Nobody had done this at this scale for a Chinese frontier model using a US government institution's resources and credibility. IRT itself dates to the 1960s and underpins virtually every standardized exam worldwide, which means the statistical framework is mature even if its application to AI evaluation is relatively new.

The Strongest Case That the Gap Does Not Matter

Eight months of capability lag means nothing if cost and accessibility matter more than raw performance. DeepSeek V4 costs $0.14 per million input tokens while GPT-5.5 costs $1.74, a 12× price differential that compounds across billions of inference calls. DeepSeek ships under an MIT license with open weights, runs on Huawei Ascend silicon that sidesteps US export controls entirely, and deploys a 1.6-trillion-parameter mixture-of-experts architecture with only 49 billion parameters active per token, making it radically cheaper to serve at scale than any comparable American system.

CAISI's own cost analysis confirmed the disparity: DeepSeek V4 was cheaper than GPT-5.4 mini on five of seven benchmarks, with savings reaching 53 percent. Brookings reported in April that Chinese labs systematically use MoE architectures, quantization, and distillation to extract capability from constrained compute budgets, a strategy that prioritizes deployment velocity over benchmark supremacy.

If a Chinese enterprise can deploy a model scoring 91% on graduate-level science questions at one-twelfth the price of the American equivalent, the 33-point gap on abstract reasoning may be strategically irrelevant for most commercial applications, because China's AI advantage may not be capability but diffusion, and benchmark-driven analysis of the AI race, including this article, systematically underweights deployment speed because deployment is harder to measure than test scores.

What This Evaluation Cannot Determine

CAISI tested five domains: cyber operations, software engineering, natural sciences, abstract reasoning, and mathematics, but did not evaluate multimodal capabilities, long-context retrieval, or multilingual performance, areas where DeepSeek V4's one-million-token native context window could shift results substantially. IRT methodology aggregates performance across tasks, which means on math specifically the gap might be weeks rather than months while on cyber it could exceed a year. Cost comparisons used API pricing rather than self-hosted economics on Huawei Ascend chips, where Chinese enterprises face a fundamentally different cost structure that this evaluation does not capture.

"Eight months" is an average, and treating it as uniform across domains would be a misreading of CAISI's own data. Additionally, research from the autonomous coding community found that Kimi K2.6, a Chinese model matching Opus on benchmarks, underperformed by 28% in real-world autonomous software engineering tasks, a production-versus-benchmark gap that CAISI's evaluation does not address for any model, American or Chinese.

The Bottom Line

The policy world has been operating on Stanford AI Index numbers showing a 2.7% US-China gap, but NIST just published an 8-month gap using pre-committed, contamination-resistant methodology and the statistical rigor of standardized testing. Both numbers are correct because they measure different things: public benchmarks measure what labs want you to see, while pre-committed benchmarks measure what they would rather you did not.

If you set AI policy or advise legislators: stop citing public benchmark comparisons as evidence of capability parity. NIST's CAISI evaluation methodology, specifically benchmark pre-commitment and held-out test design, is the new floor for credible capability assessment. Demand it from any source claiming to measure the AI race. If you build AI systems: test against benchmarks you did not train on. The gap between your self-reported numbers and independent evaluation is your benchmark inflation ratio, and knowing it honestly is the difference between strategic planning and self-deception. If you evaluate AI vendors: request performance data on non-public benchmarks. A model that scores 95% on PUMaC but 46% on ARC-AGI-2 has a very different capability profile than its marketing material suggests. Watch for CAISI's next evaluation cycle and whether other national standards bodies adopt the pre-commitment methodology, because the moment independent evaluation becomes routine, benchmark inflation loses its power as a narrative tool.

Sources

NIST CAISI (May 1, 2026). Evaluation of DeepSeek V4 Pro: IRT-estimated Elo 999 ± 27 vs. 1,260 ± 28 for top US model; 9 benchmarks, 5 domains, 16 total benchmarks fitted on 35 models. NIST CAISI Evaluation Report
Stanford HAI (2026). AI Index Report: US-China performance gap narrowed to 2.7% on public benchmarks; China leads in AI patents (69.7%). Stanford HAI AI Index 2026
Digital Trends (2026). DeepSeek V4: 1.6T parameters, 49B active per token, MIT-licensed, $0.14/M input tokens. Digital Trends DeepSeek V4 Coverage
Brookings Institution (April 2026). Competing AI strategies for the US and China: MoE, quantization, distillation strategies; Alibaba $53B, Microsoft $80B AI capex. Brookings Institution Analysis
Time to Build (2026). Benchmark Winners Aren't Production Winners: Kimi K2.6 underperforms benchmarks by 28% in autonomous work. Time to Build Production Analysis