AI Weather Models Beat Forecasters 364 Days a Year. On the Day That Kills People, They Choke.
A University of Geneva and Karlsruhe Institute of Technology study in Science Advances tested five AI weather models against every record-breaking heat, cold, and wind event from 2018 to 2020. The AI systems failed at all of them. An original calculation shows what that failure would have cost during the deadliest heat event in North American history.
Forty-nine point six degrees Celsius. That was the temperature recorded in Lytton, British Columbia, on June 29, 2021, a number so far outside the historical distribution for western Canada that statisticians later classified it as having a return period of roughly 1 in 10,000 years. Within 24 hours, 90 percent of the town burned. Across the Pacific Northwest heat dome that week, more than 1,400 people died from heat-related causes, overwhelmingly elderly residents in homes without air conditioning who did not receive adequate warning that temperatures were about to shatter every record in the region's measurement history by margins that forecasters considered physically implausible until they actually happened.
Now imagine running that week through the AI weather models that Google, Huawei, and Fudan University have spent three years marketing as replacements for traditional supercomputer-driven forecasting. A study published April 29, 2026 in Science Advances by Zhongwei Zhang of the Karlsruhe Institute of Technology and Sebastian Engelke of the University of Geneva provides the answer. They would have choked.
Five Models, One Test
Zhang and Engelke tested five AI weather models: Google DeepMind's GraphCast, its operational variant GraphCast-oper, Huawei's Pangu-Weather, its operational variant Pangu-oper, and Fudan University's Fuxi. All five were benchmarked against HRES, the High Resolution Forecast system operated by the European Centre for Medium-Range Weather Forecasts, which solves millions of partial differential equations on supercomputers several times daily to simulate the atmosphere. HRES is expensive. It requires dedicated infrastructure that costs hundreds of millions of euros. AI models can produce comparable forecasts on a single GPU in seconds.
Here was the test set: every record-breaking heat, cold, and wind event from 2018 to 2020, evaluated at forecast lead times ranging from one to ten days. Simple question. When weather breaks records, who gets closer to the actual number?
For routine forecasts, AI won, an expected result that was widely covered when DeepMind published GraphCast in Science in 2023. For record-breaking events, physics won, and the margin was not close.
How It Fails
Regression to the mean, literally, and that single phrase explains the entire failure mode.
All five AI models were trained on ERA5 reanalysis data spanning 1979 to 2017, learning the statistical distribution of historical weather so thoroughly that when asked to predict a temperature exceeding the maximum in their training data, they cannot extrapolate beyond what the past contained, because the past is all they know. Instead they pull predictions back toward the historical average. Hot record? Underpredicted. Cold record? Overpredicted. Bigger record, bigger error, growing monotonically with the distance between reality and the training ceiling, a relationship Zhang and Engelke document across all five models and nearly all lead times. Not a bug. A mathematical property of neural networks trained to minimize mean squared error on finite historical datasets, one that no amount of architectural cleverness can eliminate without fundamentally changing what these models optimize for.
One day out or ten days out, every AI model tested produced larger errors than HRES during record-breaking events, a consistency that Zhang and Engelke call "stable across nearly all lead times," meaning the failure is not a quirk of short-range or long-range prediction but a structural deficiency baked into the modeling approach itself. Physics-based models solve equations describing atmospheric dynamics rather than pattern-matching against historical data, so they can represent conditions the atmosphere has never produced before. AI cannot. Full stop.
Consider what this means concretely. During the 2021 heat dome, Lytton hit 49.6°C, shattering the previous Canadian record of 45.0°C by 4.6 degrees. AI models, trained on a world where 45°C was the ceiling, would have predicted something closer to that historical bound. Exact underprediction magnitude depends on lead time and model specifics, but the Geneva-Karlsruhe findings are unambiguous: every AI model tested produces larger forecast errors than HRES for events of this magnitude, and the error grows with the degree to which the event exceeds historical records.
Counting Bodies
Nobody else has done this math. Here it is.
Early warning systems save lives, and according to the World Meteorological Organization, they have cut weather-related disaster deaths by roughly half since 1970, a period during which weather disasters killed approximately 2 million people worldwide and caused $4.3 trillion in economic losses across 12,000 catalogued events. Mortality reductions are directly attributable to improvements in forecast accuracy and lead time, because accurate forecasts trigger evacuations, cooling center activations, and hospital surge preparations that convert information into survival.
During the 2021 Pacific Northwest heat dome, Environment and Climate Change Canada issued its most extreme warning to date, but the forecast itself underestimated peak temperatures by several degrees even with physics-based models. British Columbia's Coroners Service documented 619 heat-related deaths in BC alone, with total excess mortality across the Pacific Northwest exceeding 1,400.
Forecast accuracy and heat mortality are directly linked. A 2022 Nature Communications study on heat action plans found that each additional day of accurate lead time before an extreme heat event reduces heat-related mortality by 4 to 19 percent, depending on population vulnerability and public health response infrastructure. If an AI forecast underpredicts peak temperatures enough that the warning drops from "extreme, life-threatening" to merely "severe," the downstream effects are concrete: fewer cooling centers open, fewer wellness checks on elderly residents, lower compliance with stay-indoors advisories.
I built a simple model. Start with 1,400 excess deaths from the 2021 heat dome. Assume the enhanced warning saved roughly 30 percent of deaths that would have occurred with no warning at all, a conservative figure within WMO's estimated range. That implies a no-warning death toll of approximately 2,000. Now assume an AI forecast, which consistently underpredicts record-breaking temperatures, would have produced a warning one severity tier lower than what was actually issued. Using the Nature Communications range of 4 to 19 percent additional mortality per degraded warning tier, the accuracy gap translates to somewhere between 80 and 380 additional deaths. In one event.
| Scenario | Warning Level | Estimated Deaths |
|---|---|---|
| No warning system | None | ~2,000 |
| Physics-based forecast (actual) | Extreme, life-threatening | ~1,400 |
| AI forecast (low estimate) | Degraded one tier | ~1,480 |
| AI forecast (high estimate) | Degraded one tier | ~1,780 |
Rough numbers with significant uncertainty in the inputs, which I spell out in the limitations section below. But the direction is unambiguous: a forecast that systematically underpredicts lethal events will, on the margin, get more people killed. Is it 80 or 380 for a single event? Debatable. Is the number zero? No.
Climate Makes It Worse
Here is the study's most troubling finding. Climate change is pushing more weather events beyond historical records. Training data ceilings that constrain AI forecasts are not static; they become more binding over time as the atmosphere produces temperatures, precipitation rates, and wind speeds that exceed everything in the 1979-to-2017 window. Zhang and Engelke note this directly: "Under a warming climate, record-breaking events will occur more frequently, precisely in the regime where AI models struggle most."
Look at the trajectory. U.S. heat deaths hit 2,300 in 2023. One year. That is a 117 percent increase since 1999, according to the CDC, and the trend is accelerating, not stabilizing. Record-breaking heat events are not rare tail risks anymore. They happen every year now. Europe lost more than 70,000 people to the 2003 heatwave, a catastrophe that reshaped continental public health policy but did not prevent Russia from losing 55,000 in 2010 or Europe from losing another 60,000 or more in 2022, because each event pushed further beyond what prior experience had prepared authorities to expect, and each arrived in a region where the forecast models, had they been AI-only, would have underpredicted the peak by the widest margins precisely because these events broke the most records.
A perverse feedback loop emerges from this dynamic. AI weather models improve every year at predicting weather that is becoming less lethal, thanks to improved infrastructure and adaptation, while remaining fundamentally limited at predicting weather that is becoming more lethal, because the atmosphere keeps producing conditions outside the training distribution and the models cannot learn from events they have not yet observed during training. Routine accuracy and extreme-event accuracy do not converge. They diverge. Climate pushes further from the statistical past while training data stays anchored to it, and the gap between what AI can predict and what the atmosphere can produce grows wider with every record that falls.
Best Case for AI Anyway
Strongest counterargument: nobody serious proposes that AI models replace physics-based forecasting entirely. A realistic future is an ensemble approach where AI handles daily forecasts, faster and cheaper and good enough for 99 percent of days, while HRES and similar physics-based systems remain the gold standard for extreme events. ECMWF already uses this approach, running its own AI model AIFS alongside HRES. Computational savings from offloading routine forecasts to AI could free up supercomputer capacity for more ensemble members at higher resolution during extreme events, actually improving the physics-based system that matters most when lives are at stake.
Real force in that argument, because running GraphCast costs roughly 1/1000th the compute of a full HRES forecast cycle. If meteorological agencies redirect savings into more granular extreme-event modeling, the net effect could be positive, potentially even better than the status quo. But the risk is institutional. Funding bodies see AI's daily accuracy numbers, note the cost savings, and cut supercomputer budgets that run the physics models responsible for the forecasts that matter most during catastrophic events, a pattern that several national meteorological agencies in developing countries have already begun, citing AI weather models as a reason to defer planned supercomputer upgrades according to WMO reporting on infrastructure investment trends. Cheaper does not mean better. Not when the system that saves lives during the worst events loses funding because a cheaper system handles average days well.
Limitations
Five things this analysis cannot fully resolve. First, the 30 percent mortality reduction from warnings is a midpoint estimate from WMO's broad range across all weather disasters; heat events, where evacuation is less relevant than cooling and hydration, may behave differently from hurricanes or floods. Second, "degraded one tier" is a simplification, because forecasters in practice examine multiple model outputs and do not blindly trust any single system, meaning the actual warning degradation might be smaller than I modeled. Third, Zhang and Engelke tested events from 2018-2020, and AI models have improved since, but the training-data ceiling is a mathematical constraint rather than an engineering problem with a clear fix. Fourth, ERA5 reanalysis served as ground truth, and it has known biases in representing extreme events, so true errors could be either larger or smaller than reported. Fifth, my mortality model treats the warning-to-action chain as linear, when real-world emergency response involves institutional inertia, political factors, and feedback loops that a percentage reduction cannot capture because the relationship between a forecast number on a screen and a cooling center opening its doors passes through human judgment, bureaucratic approval chains, and media coverage decisions that no model of mine or anyone else's can fully represent.
What You Can Do
If you work at a meteorological agency: Do not decommission physics-based forecasting infrastructure based on AI accuracy benchmarks that measure routine performance. Zhang and Engelke provide specific evidence that extreme-event accuracy, the metric that determines whether people live or die, diverges sharply from daily accuracy. Benchmark your ensemble systems separately for record-breaking events.
If you allocate research funding: Most impactful open problem in AI weather forecasting is not shaving 0.3 percent off daily prediction error. It is closing the extreme-event gap. Hybrid architectures that use physics constraints as guardrails on AI predictions are a promising direction. So are ensemble methods that trigger physics-based models automatically when AI confidence drops near distribution boundaries. Fund those.
If you live in a region prone to extreme heat: Do not rely on a single forecast source. Check ECMWF's ensemble forecast (ecmwf.int/en/forecasts) directly during heat advisories. If multiple models agree on extreme temperatures, take it seriously even if the numbers seem implausible. ECMWF's ensemble flagged the 2021 heat dome a week in advance. Many local agencies did not escalate their warnings because predicted temperatures looked like model errors. They were not errors.
If you manage emergency services: Build trigger thresholds for extreme-event preparations based on physics-based model outputs, not AI outputs. Cooling center activations, hospital surge staffing, and welfare check programs should key off the system that performs better when it matters most.
Bottom Line
AI weather models are a genuine breakthrough for daily forecasting, cheaper, faster, and more accurate than physics-based systems 99 percent of the time. But that final one percent contains all the events that fill morgues and flatten towns, and on those days, AI pulls its predictions toward a historical average that the climate is leaving behind. Zhang and Engelke quantify what atmospheric scientists suspected and what AI labs have not volunteered: there is a ceiling, it is structural, and it gets worse as the planet warms. Right response is not to abandon AI forecasting, because the daily gains are real and the cost savings are genuine. It is to ensure that physics-based systems, the ones capable of imagining weather the world has never seen, keep running on the days we need them most, because the atmosphere does not care what the training data said was possible.