The Fastest Deflation Curve in History: AI Inference Costs Are Dropping 10x Per Year

One hundred to one.

That is the ratio between what OpenAI charged for GPT-4 output tokens at launch in March 2023 and what it charged for GPT-4o mini output tokens 16 months later. The first cost $60 per million output tokens. The second cost $0.60. On the MMLU benchmark, a standard test of broad knowledge across 57 subjects, GPT-4 scored 86.4%. GPT-4o mini scored 82.0%. On HumanEval, a code generation benchmark, the cheaper model actually won: 87.2% versus 67%.

A model 100 times cheaper. Roughly equivalent on knowledge benchmarks. Better at writing code. Released barely a year after the model it was designed to replace.

This is not normal. Every technology has a cost curve. Silicon chips follow Moore's Law. Solar panels follow Swanson's Law. Batteries follow Wright's Law. AI inference is now following something steeper than all of them, and the gap is not close.

The Numbers, Plotted

Constructing a proper deflation curve requires apples-to-apples comparisons. Below, I track the cost of obtaining GPT-4-equivalent output quality through publicly listed API prices, using MMLU and HumanEval as rough capability anchors.

Model	Date	Output $/M tokens	MMLU	HumanEval
GPT-4 (8K)	Mar 2023	$60.00	86.4%	67.0%
GPT-4 Turbo	Nov 2023	$30.00	86.4%	~67%
GPT-4o	May 2024	$15.00	~87%	~90%
GPT-4o mini	Jul 2024	$0.60	82.0%	87.2%
GPT-5 nano	2026	$0.40	>82%	>87%

From $60.00 to $0.60 in 16 months. Annualized, that is a 99.4% decline per year in the cost of GPT-4-equivalent output. Even if you penalize the comparison for GPT-4o mini's 4.4-point MMLU gap and call the effective deflation only 50x rather than 100x, the annualized rate is still above 95%.

By early 2026, GPT-5 nano costs $0.40 per million output tokens and outperforms GPT-4 on every published benchmark. The frontier has moved to GPT-5.2 at $14 per million tokens, delivering capabilities that did not exist at any price two years ago.

The Four Curves, Compared

Every transformative technology has a characteristic cost-decline rate. Below are the annualized deflation rates for four of the most important technology curves in history, calculated from public data.

Technology	Cost Metric	Annual Decline	Time to 10x Cheaper
Transistors (Moore's Law)	$/transistor	~30%	~7 years
Solar PV (Swanson's Law)	$/watt	~10%	~22 years
Li-ion Batteries (Wright's Law)	$/kWh	~15%	~14 years
AI Inference	$/M tokens (equiv. quality)	~90%	~1 year

Moore's Law needs roughly seven years to deliver a 10x cost reduction. AI inference does it in twelve months. The gap is not 2x or 3x. It is an order of magnitude faster, running at approximately 3x the slope of Moore's Law on a log-price chart.

The data sources: McKinsey's analysis puts DRAM cost decline at 30-35% per year over several decades. Swanson's Law averages roughly 20% per doubling of installed capacity, translating to about 10% annually. Wright's Law for batteries shows a 19% decline per doubling of cumulative production, historically about 15% per year. AI inference blows past all three.

The Quadruple Learning Curve

Why is AI inference deflating this fast? The short answer: four independent vectors are compounding simultaneously. No prior technology had more than two.

Vector 1: Hardware. NVIDIA's A100 (2020) to H100 (2022) to B200 (2024) progression delivers roughly 2-3x inference throughput per chip generation. Each generation arrives on an 18-24 month cadence. Foundry improvements (TSMC's 4nm to 3nm) contribute another 15-20% per node. GPUnex data confirms cost-per-FLOP dropping steadily.

Vector 2: Algorithmic efficiency. Mixture-of-experts architectures (GPT-4's rumored ~1.8 trillion parameters with only ~280 billion active per token), quantization-aware training (FP16 to FP8 to INT4), speculative decoding (2-3x speedup for structured outputs), and flash attention (reducing memory bandwidth bottlenecks). Each technique independently delivers 1.5-3x improvements. Combined, they compound to 2-3x per year.

Vector 3: Competition. In 2023, OpenAI was essentially the sole provider of GPT-4-class inference. By 2026, Anthropic (Claude Haiku at $5/M output), Google (Gemini Flash), Meta (Llama open-source, zero marginal cost), Mistral, and DeepSeek are all offering competitive models. Price competition alone accounts for roughly 1.5-2x annual deflation from margin compression.

Vector 4: Distillation. This is the one that has no analog in physical technology. A $79 million GPT-4 training run produces a "teacher" model. That teacher then generates synthetic training data for a smaller "student" model at negligible marginal cost. DeepSeek R1 was trained for $294,000 using efficiency optimizations and distillation from frontier models. The student model cost 0.4% of what GPT-4 cost to train and competes on many benchmarks. This is like photocopying a factory. You cannot photocopy a solar panel factory. You cannot photocopy a chip fab.

Stack the vectors: 2.5x (hardware) times 2.5x (algorithms) times 1.5x (competition) times 3x (distillation) = approximately 28x per year. This rough multiplication aligns with the observed 10-100x annual deflation, depending on how strictly you define "equivalent quality."

DeepSeek and the Proof of Compounding

The most striking validation of the quadruple curve is DeepSeek R1. This Chinese lab trained a model for $294,000 in 2025 that competes with GPT-4 on reasoning benchmarks. For context, GPT-4's training run cost an estimated $79 million. That is a 269x cost reduction in training, achieved not through better hardware alone (DeepSeek used H800 GPUs, actually slower than A100s on some tasks), but through aggressive algorithmic optimization and distillation techniques.

If training costs can drop 269x in two years, inference costs can drop even faster, because inference benefits from all the same techniques plus additional ones (batching, caching, quantization) that do not apply to training.

The Electricity Floor

Every deflation curve eventually hits a physical floor. For transistors, it was quantum tunneling at sub-5nm gate lengths. For solar, it is the Shockley-Queisser limit on single-junction cell efficiency. For AI inference, the floor is electricity.

A single H100 GPU consumes about 700 watts at peak load. At U.S. commercial electricity rates of $0.07/kWh, that is roughly $0.05 per hour of compute. Current GPT-4o mini pricing ($0.60/M output tokens) implies generating about 12 million output tokens per GPU-hour. At scale, the electricity cost alone for one million tokens works out to roughly $0.004. There is still a ~150x gap between electricity cost and current API price, meaning algorithmic and competitive deflation have significant room to continue compressing that margin.

The electricity floor for AI inference sits somewhere around $0.001-0.01 per million tokens for GPT-4-class output, depending on future hardware efficiency. At that price, generating a 1,000-word article would cost a fraction of a cent. We are perhaps 2-3 years away at the current deflation rate.

The Strongest Case Against This Analysis

The most serious objection: this comparison is partially rigged. GPT-4o mini does not match GPT-4 on every task. Its MMLU score is 4.4 points lower. On complex multi-step reasoning, the gap widens further. Calling it "equivalent" overstates the deflation.

This is fair. But even correcting aggressively for quality differences, the cost-per-unit-of-capability (measured as benchmark performance per dollar) is declining at 70-80% annually. That is still more than double Moore's Law at its peak.

A second objection: much of the "deflation" is competitive margin compression, not true cost reduction. OpenAI's margins on GPT-4 at $60/M were enormous. The price drop to GPT-4o mini at $0.60/M reflects both cheaper inference and thinner margins. True engineering cost deflation is slower than price deflation.

Also fair. But margin compression is itself a feature, not a bug. End users do not care whether their costs dropped because of better hardware or because Anthropic forced a price war. The economic impact is the same. And even OpenAI's cost of goods sold, estimated by analysts at roughly $0.15-0.30 per million tokens for GPT-4o mini class models, represents a steep decline from the estimated $8-12 cost of serving GPT-4 at launch.

A third objection: three years of data is too short to declare a "law." Solar and battery cost curves have 40+ years of data. AI inference pricing has existed for barely 36 months. The current rate could be a one-time correction (the initial move from general-purpose training to distilled, optimized inference) rather than an ongoing trend.

This is the strongest objection. The honest answer is: we do not know if the ~10x/year rate is sustainable. It could slow to 3-5x/year as low-hanging algorithmic fruit is picked and electricity costs become the binding constraint. Even at 3x/year, AI inference would still deflate faster than any other technology in history.

What This Means

Limitations first. This analysis relies on published API prices, not internal cost-of-goods-sold data. The MMLU and HumanEval benchmarks are imperfect proxies for general capability. Three years of pricing data is insufficient to establish a durable trend with high confidence. The four-vector framework is a conceptual model, not a physics equation, and the individual multipliers are approximate.

The Bottom Line

We have lived through Moore's Law for six decades. It delivered a 30% annual decline in the cost of a transistor and reshaped civilization. AI inference costs are declining at three times that rate, compounding across four independent vectors simultaneously. Whether the current 10x/year pace holds or moderates to 3-5x, the trajectory is clear: tasks that cost dollars today will cost pennies within two years and fractions of a penny within four. The economics of intelligence are changing faster than the economics of any technology we have measured. The only question is whether our institutions can adapt at the same speed. History says they cannot.