The Fastest Deflation Curve in History: AI Inference Costs Are Dropping 10x Per Year
In March 2023, GPT-4-class output cost $60 per million tokens. By July 2024, GPT-4o mini delivered equivalent benchmark scores for $0.60. That is 100x deflation in 16 months, and it dwarfs every technology cost curve on record.
One hundred to one.
That is the ratio between what OpenAI charged for GPT-4 output tokens at launch in March 2023 and what it charged for GPT-4o mini output tokens 16 months later. The first cost $60 per million output tokens. The second cost $0.60. On the MMLU benchmark, a standard test of broad knowledge across 57 subjects, GPT-4 scored 86.4%. GPT-4o mini scored 82.0%. On HumanEval, a code generation benchmark, the cheaper model actually won: 87.2% versus 67%.
A model 100 times cheaper. Roughly equivalent on knowledge benchmarks. Better at writing code. Released barely a year after the model it was designed to replace.
This is not normal. Every technology has a cost curve. Silicon chips follow Moore's Law. Solar panels follow Swanson's Law. Batteries follow Wright's Law. AI inference is now following something steeper than all of them, and the gap is not close.
The Numbers, Plotted
Constructing a proper deflation curve requires apples-to-apples comparisons. Below, I track the cost of obtaining GPT-4-equivalent output quality through publicly listed API prices, using MMLU and HumanEval as rough capability anchors.
| Model | Date | Output $/M tokens | MMLU | HumanEval |
|---|---|---|---|---|
| GPT-4 (8K) | Mar 2023 | $60.00 | 86.4% | 67.0% |
| GPT-4 Turbo | Nov 2023 | $30.00 | 86.4% | ~67% |
| GPT-4o | May 2024 | $15.00 | ~87% | ~90% |
| GPT-4o mini | Jul 2024 | $0.60 | 82.0% | 87.2% |
| GPT-5 nano | 2026 | $0.40 | >82% | >87% |
From $60.00 to $0.60 in 16 months. Annualized, that is a 99.4% decline per year in the cost of GPT-4-equivalent output. Even if you penalize the comparison for GPT-4o mini's 4.4-point MMLU gap and call the effective deflation only 50x rather than 100x, the annualized rate is still above 95%.
By early 2026, GPT-5 nano costs $0.40 per million output tokens and outperforms GPT-4 on every published benchmark. The frontier has moved to GPT-5.2 at $14 per million tokens, delivering capabilities that did not exist at any price two years ago.
The Four Curves, Compared
Every transformative technology has a characteristic cost-decline rate. Below are the annualized deflation rates for four of the most important technology curves in history, calculated from public data.
| Technology | Cost Metric | Annual Decline | Time to 10x Cheaper |
|---|---|---|---|
| Transistors (Moore's Law) | $/transistor | ~30% | ~7 years |
| Solar PV (Swanson's Law) | $/watt | ~10% | ~22 years |
| Li-ion Batteries (Wright's Law) | $/kWh | ~15% | ~14 years |
| AI Inference | $/M tokens (equiv. quality) | ~90% | ~1 year |
Moore's Law needs roughly seven years to deliver a 10x cost reduction. AI inference does it in twelve months. The gap is not 2x or 3x. It is an order of magnitude faster, running at approximately 3x the slope of Moore's Law on a log-price chart.
The data sources: McKinsey's analysis puts DRAM cost decline at 30-35% per year over several decades. Swanson's Law averages roughly 20% per doubling of installed capacity, translating to about 10% annually. Wright's Law for batteries shows a 19% decline per doubling of cumulative production, historically about 15% per year. AI inference blows past all three.
The Quadruple Learning Curve
Why is AI inference deflating this fast? The short answer: four independent vectors are compounding simultaneously. No prior technology had more than two.
Vector 1: Hardware. NVIDIA's A100 (2020) to H100 (2022) to B200 (2024) progression delivers roughly 2-3x inference throughput per chip generation. Each generation arrives on an 18-24 month cadence. Foundry improvements (TSMC's 4nm to 3nm) contribute another 15-20% per node. GPUnex data confirms cost-per-FLOP dropping steadily.
Vector 2: Algorithmic efficiency. Mixture-of-experts architectures (GPT-4's rumored ~1.8 trillion parameters with only ~280 billion active per token), quantization-aware training (FP16 to FP8 to INT4), speculative decoding (2-3x speedup for structured outputs), and flash attention (reducing memory bandwidth bottlenecks). Each technique independently delivers 1.5-3x improvements. Combined, they compound to 2-3x per year.
Vector 3: Competition. In 2023, OpenAI was essentially the sole provider of GPT-4-class inference. By 2026, Anthropic (Claude Haiku at $5/M output), Google (Gemini Flash), Meta (Llama open-source, zero marginal cost), Mistral, and DeepSeek are all offering competitive models. Price competition alone accounts for roughly 1.5-2x annual deflation from margin compression.
Vector 4: Distillation. This is the one that has no analog in physical technology. A $79 million GPT-4 training run produces a "teacher" model. That teacher then generates synthetic training data for a smaller "student" model at negligible marginal cost. DeepSeek R1 was trained for $294,000 using efficiency optimizations and distillation from frontier models. The student model cost 0.4% of what GPT-4 cost to train and competes on many benchmarks. This is like photocopying a factory. You cannot photocopy a solar panel factory. You cannot photocopy a chip fab.
Stack the vectors: 2.5x (hardware) times 2.5x (algorithms) times 1.5x (competition) times 3x (distillation) = approximately 28x per year. This rough multiplication aligns with the observed 10-100x annual deflation, depending on how strictly you define "equivalent quality."
DeepSeek and the Proof of Compounding
The most striking validation of the quadruple curve is DeepSeek R1. This Chinese lab trained a model for $294,000 in 2025 that competes with GPT-4 on reasoning benchmarks. For context, GPT-4's training run cost an estimated $79 million. That is a 269x cost reduction in training, achieved not through better hardware alone (DeepSeek used H800 GPUs, actually slower than A100s on some tasks), but through aggressive algorithmic optimization and distillation techniques.
If training costs can drop 269x in two years, inference costs can drop even faster, because inference benefits from all the same techniques plus additional ones (batching, caching, quantization) that do not apply to training.
The Electricity Floor
Every deflation curve eventually hits a physical floor. For transistors, it was quantum tunneling at sub-5nm gate lengths. For solar, it is the Shockley-Queisser limit on single-junction cell efficiency. For AI inference, the floor is electricity.
A single H100 GPU consumes about 700 watts at peak load. At U.S. commercial electricity rates of $0.07/kWh, that is roughly $0.05 per hour of compute. Current GPT-4o mini pricing ($0.60/M output tokens) implies generating about 12 million output tokens per GPU-hour. At scale, the electricity cost alone for one million tokens works out to roughly $0.004. There is still a ~150x gap between electricity cost and current API price, meaning algorithmic and competitive deflation have significant room to continue compressing that margin.
The electricity floor for AI inference sits somewhere around $0.001-0.01 per million tokens for GPT-4-class output, depending on future hardware efficiency. At that price, generating a 1,000-word article would cost a fraction of a cent. We are perhaps 2-3 years away at the current deflation rate.
The Strongest Case Against This Analysis
The most serious objection: this comparison is partially rigged. GPT-4o mini does not match GPT-4 on every task. Its MMLU score is 4.4 points lower. On complex multi-step reasoning, the gap widens further. Calling it "equivalent" overstates the deflation.
This is fair. But even correcting aggressively for quality differences, the cost-per-unit-of-capability (measured as benchmark performance per dollar) is declining at 70-80% annually. That is still more than double Moore's Law at its peak.
A second objection: much of the "deflation" is competitive margin compression, not true cost reduction. OpenAI's margins on GPT-4 at $60/M were enormous. The price drop to GPT-4o mini at $0.60/M reflects both cheaper inference and thinner margins. True engineering cost deflation is slower than price deflation.
Also fair. But margin compression is itself a feature, not a bug. End users do not care whether their costs dropped because of better hardware or because Anthropic forced a price war. The economic impact is the same. And even OpenAI's cost of goods sold, estimated by analysts at roughly $0.15-0.30 per million tokens for GPT-4o mini class models, represents a steep decline from the estimated $8-12 cost of serving GPT-4 at launch.
A third objection: three years of data is too short to declare a "law." Solar and battery cost curves have 40+ years of data. AI inference pricing has existed for barely 36 months. The current rate could be a one-time correction (the initial move from general-purpose training to distilled, optimized inference) rather than an ongoing trend.
This is the strongest objection. The honest answer is: we do not know if the ~10x/year rate is sustainable. It could slow to 3-5x/year as low-hanging algorithmic fruit is picked and electricity costs become the binding constraint. Even at 3x/year, AI inference would still deflate faster than any other technology in history.
What This Means
Limitations first. This analysis relies on published API prices, not internal cost-of-goods-sold data. The MMLU and HumanEval benchmarks are imperfect proxies for general capability. Three years of pricing data is insufficient to establish a durable trend with high confidence. The four-vector framework is a conceptual model, not a physics equation, and the individual multipliers are approximate.
The Bottom Line
We have lived through Moore's Law for six decades. It delivered a 30% annual decline in the cost of a transistor and reshaped civilization. AI inference costs are declining at three times that rate, compounding across four independent vectors simultaneously. Whether the current 10x/year pace holds or moderates to 3-5x, the trajectory is clear: tasks that cost dollars today will cost pennies within two years and fractions of a penny within four. The economics of intelligence are changing faster than the economics of any technology we have measured. The only question is whether our institutions can adapt at the same speed. History says they cannot.
Sources
- GPT-4 vs GPT-4o Mini: performance and pricing comparison, including MMLU (86.4% vs 82.0%), HumanEval (67% vs 87.2%), and pricing ($60/M vs $0.60/M output tokens) (DocsBot)
- ChatGPT API pricing 2026: full token cost history from GPT-3.5 Turbo through GPT-5.2, including GPT-5 nano at $0.40/M output tokens (IntuitionLabs, updated March 2026)
- AI training costs 2026: GPT-4 at $79M, DeepSeek R1 at $294K, cost-per-FLOP declining ~10x/year (GPUnex, February 2026)
- Moore's Law: transistor doubling every 24 months, 30-35% annual cost decline per transistor, >99% decline in quality-adjusted computer price index since 1970s (Wikipedia, sourcing Intel data and U.S. BLS)
- McKinsey semiconductor analysis: DRAM per-bit prices declined 30-35% annually for decades (McKinsey & Company)
- Swanson's Law: solar PV costs decline 20% per doubling of capacity, ~10% per year historical average (AEI / The Economist)
- Wright's Law for batteries: 19% cost decline per doubling of cumulative production, ~15% annual decline historically (Our World in Data)
- Anthropic Claude API pricing: Haiku 4.5 at $5/M output, Sonnet 4.6 at $15/M, Opus 4.6 at $25/M (IntuitionLabs / Anthropic documentation)
- AI cost deflation overview: costs dropping from $10/M tokens to $2.50/M over 2024-2025, 75% decrease (Ramp)