💻 Technology

Companies Will Spend $725 Billion on AI Computing This Year. They're Using 5% of It.

Enterprise GPU clusters across Kubernetes environments run at 5% utilization, according to Cast AI's analysis of tens of thousands of clusters. Hyperscalers have committed $725 billion to AI infrastructure in 2026, up 77% year over year. At 5% utilization, every useful GPU-hour costs 20 times its sticker price, and simple scheduling fixes could recover more than $150 billion in idle capacity without buying a single new chip.

A vast server room stretching into the distance with most racks dark and powered down while a small cluster glows intensely in the foreground

Five percent. That is the average GPU utilization rate across enterprise Kubernetes clusters in 2026, according to Cast AI's third annual State of Kubernetes Optimization Report, which drew its data from tens of thousands of non-optimized clusters running on AWS, Azure, and Google Cloud Platform, covering workloads that range from inference endpoints and fine-tuning jobs to developer sandboxes and internal AI tooling. CPU utilization dropped to 8%, down from 10% a year earlier, while memory fell from 23% to 20%, and GPUs, the single most expensive silicon in any modern data center, sat idle nineteen hours out of every twenty.

Meanwhile, the four largest hyperscalers committed more money to AI infrastructure in 2026 than any industry has ever committed to anything in a single calendar year. Microsoft, Alphabet, Amazon, and Meta will collectively spend $725 billion on AI compute this year, a 77% increase over 2025, and in Q1 alone their combined capital expenditures hit $130 billion, which works out to more than a billion dollars per day flowing into data center construction, cooling systems, power infrastructure, and GPU procurement across three continents.

Nobody disputes those numbers. And nobody disputes the utilization figure. What nobody has done is divide one by the other and sit with what comes out.

Twenty Dollars for a Dollar's Worth of Work

An NVIDIA H100 GPU rents for roughly $3.93 per hour on-demand through major cloud providers, according to aggregated pricing data from getdeploying.com and IntuitionLabs. At 5% utilization, the effective cost per useful GPU-hour becomes $3.93 divided by 0.05: $78.60 for one hour of actual computation, delivered on hardware that was provisioned, powered, cooled, and network-connected for the other nineteen hours it spent doing absolutely nothing.

NVIDIA's own Multi-Instance GPU technology, which partitions a single physical GPU into isolated instances running independent workloads simultaneously, routinely pushes utilization to 40-70% in production voice AI and inference pipelines. At 40% utilization, the effective cost drops to $9.83 per useful GPU-hour; at 70%, it falls to $5.61. That is a 14x cost reduction from the current enterprise average, achieved entirely through software configuration on hardware already installed in existing racks.

WinBuzzer projects $401 billion in enterprise AI infrastructure spending for 2026. If GPU-related costs represent 40-50% of that total, a conservative range given that GPU servers account for the fastest-growing procurement category in enterprise IT, then companies will spend between $160 billion and $200 billion specifically on GPU capacity this year, and at 5% utilization, between $152 billion and $190 billion of that investment will sit idle at any given moment, drawing power, generating heat, and producing nothing.

Conservative estimate: over $150 billion in GPU capacity will go unused in 2026, not because companies bought the wrong hardware, but because they never configured the software to share it.

Why the Waste Is Concentrated Where You Think

One distinction matters more than any other when interpreting these numbers: what happens inside enterprise GPU clusters versus what happens inside hyperscaler training environments. When Meta trains Llama or Google trains Gemini, those purpose-built clusters run at sustained near-maximum utilization for weeks or months, because saturation is the entire point of a training run, and every idle GPU-second represents a day added to the timeline before the model ships. Hyperscaler GPU training efficiency is likely well above 70%, though none of the four majors publish exact figures.

Enterprise clusters are a different animal entirely. Companies provisioning GPU capacity for inference workloads, internal AI tooling, fine-tuning experiments, and developer sandboxes consistently overprovision by enormous margins because Kubernetes historically treats GPUs as indivisible resources: one pod gets one whole GPU, even if the workload running inside that pod uses 3% of its capacity, and the remaining 97% sits locked, allocated, and utterly idle. Cast AI found CPU overprovisioning at 69% and memory overprovisioning at 79%, and GPU overprovisioning is structurally worse because of that integer allocation constraint.

Solutions exist. Kubernetes 1.34 introduced Dynamic Resource Allocation, which enables fractional GPU scheduling natively for the first time, according to CIO.com. Cast AI shipped DRA support in May 2026. HAMi, an open-source GPU virtualization project, claims up to 90% utilization through its abstraction layer. Adoption, not invention, is the bottleneck.

Scarcity and Waste, Simultaneously

While enterprises leave purchased GPUs idle, the semiconductor supply chain cannot manufacture them fast enough, creating a paradox that would be comical if the dollar amounts involved were not staggering. A CNAS report identifies HBM and DRAM shortages as critical barriers that could consume 30% of AI spending in 2026 purely on memory components that sit upstream of the GPUs themselves. TSMC's 3nm fabrication capacity is fully booked through 2027. Silicon Motion's CEO told investors that foundry-level chip shortages will persist until at least 2028, by which point the next two GPU generations will have been announced and the procurement cycle will have produced an entirely new generation of silicon that enterprises will also buy, also fail to configure, and also run at single-digit utilization.

H100 prices dropped 44% since 2025, partly because Blackwell B200 availability pushed older chips toward value-tier pricing, but AWS simultaneously raised H200 prices by 15% in 2026, reflecting genuine demand pressure at the performance frontier. Enterprises face both dynamics at once: premium prices for new hardware and abysmal returns on hardware already deployed.

Strongest Counterargument

Enterprise GPU clusters are intentionally overprovisioned for burst capacity, and that framing deserves to be taken seriously because inference workloads are genuinely spiky: an AI endpoint that serves 200 requests per second at 2 AM and 15,000 requests per second at market open has to be provisioned for the peak, not the mean, and the cost of a 15-second response time during peak load is measured in lost customers and breached SLAs, not in wasted GPU-hours during the quiet periods that made the peak survivable. Fire stations sit empty between calls. That is the design, not the failure.

Fair enough, and it matters. But 5% is not "overprovisioned for bursts." Standard enterprise server practice targets 40-70% utilization in environments with significant burst requirements, and data center operators consider anything below 30% a red flag that triggers formal review. MIG partitioning, time-slicing, and DRA scheduling exist precisely to maintain burst headroom while running background workloads during the 95% of time that nothing latency-sensitive is happening. Fire stations sit empty between calls, but the trucks are not left idling in the parking lot with every light flashing and every siren cycling, consuming fuel and maintenance cycles for absolutely no operational benefit, which is what 5% utilization functionally amounts to when the hardware draws power whether it computes or not.

What This Analysis Did Not Prove

Cast AI's sample consists exclusively of non-optimized Kubernetes clusters, which means their customers self-select as organizations that have not yet implemented GPU sharing or scheduling improvements, and the 5% figure therefore reflects the bottom of the distribution rather than a representative median across all enterprise GPU deployments. Some well-managed clusters likely run at 30-40%, and many others probably sit closer to zero. GPU utilization metrics themselves are imperfect: a GPU reported as idle may be waiting on memory transfers, network I/O, or preprocessing steps that bottleneck the pipeline upstream of the compute stage, and low utilization does not always mean the silicon could have been doing something else.

Our $150 billion idle capacity estimate assumes GPU-related costs constitute 40-50% of total enterprise AI infrastructure spending, and if the actual GPU share is lower because power delivery, networking equipment, and physical construction dominate the bill, the dollar figure for GPU waste shrinks proportionally even though the utilization problem persists. Finally, the $725 billion hyperscaler capex figure includes buildings, land, electrical substations, and cooling towers alongside the GPUs themselves, and the 5% utilization figure applies to enterprise Kubernetes clusters specifically, not to the hyperscaler training environments where most of that $725 billion is flowing.

What You Can Do

If you manage AI infrastructure, run an audit this week: execute kubectl top nodes and nvidia-smi across your fleet, calculate the gap between provisioned GPU-seconds and actually consumed GPU-seconds over a 7-day window, and if utilization sits below 20%, you are leaving money on the floor that requires no budget approval, no procurement cycle, and no hardware changes to recover.

Start with MIG partitioning on any NVIDIA A100 or H100 running inference workloads, because NVIDIA's engineering documentation walks through the configuration in under an hour and production benchmarks show reliable workload isolation between partitioned instances. For mixed training-and-inference environments, evaluate Kubernetes DRA scheduling, which became production-ready in version 1.34 and enables fractional GPU allocation without custom device plugins or vendor-specific orchestration layers.

If you are a CTO negotiating cloud contracts, demand utilization-based pricing tied to actual GPU-hours consumed rather than GPU-hours reserved, because cloud providers have no financial incentive to solve this problem for you: their revenue increases when your utilization decreases, since you pay for capacity provisioned around peak demand twenty-four hours a day regardless of whether anything runs during the other twenty-three.

If you are an investor evaluating AI infrastructure companies, ask one question in every earnings call: what is your average GPU utilization rate? Any company spending more than $10 million annually on GPU compute that cannot answer that question with a specific, audited number has a procurement problem it is marketing as an AI strategy.

The Bottom Line

Ninety-five percent idle is not an infrastructure strategy. It is an industry sprinting so hard toward the next GPU generation that it forgot to turn on the GPUs it already bought. More than $150 billion in enterprise GPU capacity will produce nothing this year while procurement teams fight over chip allocations they cannot secure fast enough, while TSMC books fabrication slots years in advance, and while residential electricity prices climb 7% annually under the compounding weight of data center demand that grows whether the compute inside those centers is used or not. Every one of those problems gets smaller, materially and immediately, when existing GPUs start doing work: not more hardware, not more power plants, not more water for cooling towers, just software configured correctly on machines that have been racked for months, powered continuously, and computing nothing.