← Back to all articles ⚡ Energy

A Lab Trained a Billion-Parameter AI Model Using 30 Watts. An H100 Uses 700.

Three photonic computing systems crossed from laboratory prototypes to production hardware in the span of six weeks: a PNAS-published optical processing unit achieving 50 TeraOPS per watt, Q.ANT's second-generation photonic chips benchmarked at Germany's national supercomputing center, and Oxford spinout Lumai's Iris Nova server running Llama 70B in real time. An original efficiency analysis shows these systems deliver 8.8 times more compute per watt than NVIDIA's H100, and a data center cost model puts the implied electricity savings at $73 million per year for a single 100-megawatt facility.

Close-up of a photonic processor chip with visible light beams passing through crystalline waveguides on a circuit board, data center server racks in the dark background

Fifty TeraOPS per watt.

That is the efficiency figure buried in a paper published this month in the Proceedings of the National Academy of Sciences, and it represents something the AI hardware industry has been chasing since NVIDIA's first data center GPU drew 300 watts and promised the future would be worth the electricity bill. It describes a Hybrid Electronic-Photonic Optical Processing Unit that trained a billion-parameter neural network while drawing less than 30 watts of power. For comparison, an NVIDIA H100 SXM GPU achieves 2,000 to 4,000 TeraOPS of throughput but consumes 700 watts to do it, yielding roughly 2.9 to 5.7 TeraOPS per watt. The photonic system achieves 1,500 TeraOPS at 30 watts. Divide those numbers and the gap is 8.8 to 12.5 times in favor of light over electrons.

That single metric would be interesting on its own. What makes it significant is that it landed in the same six-week window as two other milestones that collectively suggest photonic computing has stopped being a conference-talk curiosity and started becoming infrastructure.

Three Milestones in Six Weeks

In Germany, Stuttgart-based Q.ANT deployed its second-generation photonic processors at the Leibniz Supercomputing Centre, one of Europe's premier high-performance computing facilities. Its Native Processing Units use Thin-Film Lithium Niobate waveguides to perform matrix operations in the optical domain, and the Gen 2 hardware produced benchmark results that would be remarkable for any co-processor: 50 times higher matrix multiplication throughput than the first generation, 25 times faster inference on the ResNet-18 computer vision benchmark, and 6 times lower energy consumption per workload. Those chips plug into standard PCIe slots. They coexist with conventional CPUs and GPUs. They work now, under production conditions, at a national supercomputing center that serves universities across Bavaria.

In Oxford, a University of Oxford spinout called Lumai launched the Iris Nova, which it describes as the world's first optical computing server designed for real-time large language model inference. Lumai demonstrated the system running Meta's Llama 8B and Llama 70B models in real time, using a hybrid architecture that pairs an optical tensor engine for heavy mathematical operations with digital processing for system control. Lumai claims up to 90 percent lower power consumption compared to silicon GPU servers of equivalent throughput, and the UK government's Advanced Research and Invention Agency has backed the project through its AI compute program. Its Iris Nova server is available for evaluation by hyperscalers and enterprise customers today.

At March's Optical Fiber Communications Conference, AMD, Broadcom, Meta, Microsoft, NVIDIA, and OpenAI jointly announced the Optical Compute Interconnect Multi-Source Agreement, a consortium effort to standardize optical interconnect specifications at up to 800 gigabits per second for AI rack infrastructure. When every major GPU and AI model company agrees that photonics is the transport layer of the future, the question stops being whether light will replace copper in data centers and starts being when light will replace transistors in the chips themselves.

Why Light Works Where Electrons Fail

Start with the physics, which is simpler than the buzzwords suggest. Modern neural networks spend the vast majority of their computational cycles on a single operation: multiplying enormous matrices together. A forward pass through a Transformer model is, at its mathematical core, a cascade of matrix multiplications interspersed with nonlinear activation functions. GPUs perform these multiplications by switching billions of transistors on and off, and every switch generates heat. An H100 running at full load dissipates 700 watts almost entirely as thermal energy, energy that data centers must then remove using industrial cooling systems that consume additional power.

In a photonic processor, the multiplication happens differently. In the PNAS paper's system, neural network data is encoded onto a Spatial Light Modulator, essentially a microscopic digital projector screen with millions of individually controllable pixels. A low-power laser illuminates the modulator, and the light passes through a random scattering medium, a piece of engineered glass that performs millions of parallel multiplications as photons interact with the material's physical structure. A high-speed CMOS camera sensor reads the intensity pattern on the other side. From input encoding to output measurement, the entire computation consumes energy only in three places: the laser, the modulator's refresh cycle, and the camera's readout. No transistors switch. No heat accumulates. All of this math happens at the speed of light through glass, and 30 watts covers the overhead.

What makes this work for training, not just inference, is a clever algorithmic substitution. Standard deep learning uses backpropagation, which requires the network to chain error gradients backward through every layer using the exact transpose of each layer's weight matrix. Implementing perfect matrix transposition in analog optical hardware is impractical because physical systems introduce noise that compounds across layers. Instead, the PNAS researchers used Direct Feedback Alignment, an algorithm that projects the global output error directly to each hidden layer through a fixed random matrix. Because these random matrices never change during training, they can be physically embodied in the scattering medium permanently. Once fixed, the hardware does not need to constantly read, transpose, and rewrite massive weight matrices, which is the operation that makes GPU-based training so power-hungry in the first place.

The Efficiency Gap, Quantified

Here is the comparison table that no photonic computing company has published, presumably because it requires acknowledging limitations alongside the impressive headline numbers:

System Peak TeraOPS Power Draw (W) TOPS/W Verified Workload
NVIDIA H100 SXM 2,000–4,000 700 2.9–5.7 GPT-scale Transformers
PNAS OPU 1,500 30 50.0 Billion-param MLPs/CNNs
Q.ANT Gen 2 NPU Not disclosed 6× lower than Gen 1 ~17–34 (est.) ResNet-18
Lumai Iris Nova Not disclosed 90% less than GPU equiv. ~29–57 (est.) Llama 8B & 70B inference

I have estimated Q.ANT's TOPS-per-watt range from the company's stated 6x energy reduction and 50x throughput improvement over Gen 1, assuming Gen 1 operated at roughly GPU-equivalent efficiency baselines. These throughput and energy claims are company-reported, not independently benchmarked. Lumai's range derives from its 90 percent power reduction claim applied to an assumed GPU server drawing 1,500 to 3,000 watts for equivalent Llama 70B throughput. Both estimates are generous but grounded in the companies' own disclosures; readers should treat them as directional, not audited. Note that the PNAS system's 1,500 TeraOPS was measured on MLPs and CNNs, not Transformer architectures.

Across all three systems, the pattern is consistent. Photonic processing delivers somewhere between 6 and 17 times more compute per watt than the best commercially available GPU.

What This Means at Data Center Scale

The numbers get interesting when you extrapolate from single-chip efficiency to facility-level economics, a thought experiment that nobody in the photonic computing industry appears to have published and that almost certainly overstates real-world savings due to integration losses.

Consider a 100-megawatt AI data center, a size that is increasingly standard for hyperscaler deployments. At 700 watts per GPU, that facility houses approximately 143,000 NVIDIA H100 equivalents, producing 570,000 to 810,000 TeraOPS of aggregate throughput. At the PNAS paper's 50 TOPS/W efficiency, delivering the same aggregate throughput would require 11 to 16 megawatts of power. Electricity savings: 84 to 89 percent.

At a blended rate of $0.10 per kilowatt-hour, that 100MW facility's annual electricity bill is approximately $87.6 million. A photonic-equivalent facility drawing 11 to 16 megawatts would cost $9.6 to $14 million per year. That is a $73 to $78 million annual savings from a single data center, before accounting for the enormous reduction in cooling infrastructure that disappears when your compute hardware barely generates heat.

Goldman Sachs estimated that global data center electricity costs hit $11.5 billion in 2024 and will reach $34 billion by 2027, driven almost entirely by AI workloads. If photonic computing delivered even half its claimed efficiency gains at scale, the global savings by 2027 would approach $17 billion annually. That figure rivals the total annual revenue of the companies building these data centers' cooling systems.

The Strongest Case Against

What follows is the argument that photonic computing advocates are least equipped to answer, stated at full strength: NVIDIA's CUDA ecosystem has fifteen years of continuous optimization by thousands of engineers. PyTorch, TensorFlow, JAX, and the entire deep learning software stack are built around GPU-centric computation using backpropagation. Direct Feedback Alignment works for multilayer perceptrons and convolutional networks, but it has not been demonstrated on Transformer architectures, the architecture powering GPT-4, Claude, Gemini, Llama, and every frontier language model. Until DFA or a successor algorithm can train Transformers at competitive accuracy and scale, photonic processors are structurally excluded from the workloads that are driving most of the world's AI infrastructure investment.

This is not a minor caveat. Frontier model training represents the single largest and fastest-growing category of GPU demand. On the training side, the PNAS paper's billion-parameter achievement was demonstrated on MLPs and CNNs, architectures that, while commercially important, are not the ones consuming 100-megawatt data centers. Q.ANT's production benchmarks used ResNet-18, a computer vision model published in 2015. Lumai demonstrated inference on Llama 70B, which is promising for the inference market but does not address training.

A second structural challenge is the Input/Output penalty. In any hybrid photonic-electronic system, data must be converted from digital memory to an optical signal on the modulator and then converted back from analog light intensity to digital values via analog-to-digital converters. This signal conversion consumes time and energy that partially offsets the gains from optical computation itself. That 50 TOPS/W figure includes the optical computation but the real-world throughput depends on how fast you can shuttle data between the electronic and photonic domains, a bottleneck the paper acknowledges but does not resolve. For inference workloads like Lumai's Llama 70B demonstration, latency per token matters as much as throughput per watt, and no photonic vendor has published head-to-head latency benchmarks against GPUs running identical models.

No photonic processor has been manufactured at GPU-equivalent volumes, which means per-unit costs remain unknown. It is entirely possible that a system delivering 50 TOPS/W costs more per unit than a GPU delivering 5.7 TOPS/W. Until photonic chip fabrication scales, the economic case at the unit level remains theoretical.

Limitations of This Analysis

The PNAS paper's billion-parameter training demonstration used Direct Feedback Alignment on MLPs and CNNs, not Transformers. DFA's accuracy gap with standard backpropagation narrows as network width increases, but whether this holds at hundred-billion-parameter Transformer scale is unknown because nobody has tested it. Q.ANT's benchmarks were performed on ResNet-18 at LRZ, and the company has not published comparable results on modern Transformer workloads or disclosed absolute TOPS numbers for its Gen 2 hardware; its 50x throughput and 6x energy improvement claims are company-reported and have not been independently verified. Lumai demonstrated inference but has not published detailed latency, throughput, or tokens-per-second metrics against a GPU baseline. My data center cost model assumes that the PNAS paper's single-system efficiency translates linearly to facility scale, which almost certainly overstates the real-world savings because system integration, interconnect overhead, and data movement between photonic and electronic components will introduce losses. All three systems are at pilot or evaluation stage, not mass deployment.

What You Can Do

If you run data center infrastructure or make purchasing decisions for AI compute, the actionable step is to request evaluation units from Lumai (Iris Nova is available now) and monitor Q.ANT's US expansion for domestic evaluation opportunities. The technology is real enough to benchmark against your specific workloads. It is not ready to replace your GPU clusters.

If you are an investor, track three leading indicators: whether any photonic vendor demonstrates competitive Transformer training accuracy (which would eliminate the largest remaining objection), whether TSMC or GlobalFoundries announces a photonic chip fabrication partnership (which would signal volume manufacturing readiness), and whether the OCI consortium's optical interconnect standard accelerates from a specification into shipping silicon. Each milestone removes a structural barrier; all three together would constitute a regime change in AI hardware economics.

If you are a policymaker concerned about AI's energy footprint, the takeaway is that the underlying physics of efficient optical matrix multiplication has been demonstrated. Photonic processors deliver an order-of-magnitude improvement in computational energy efficiency in controlled settings. What remains is engineering, software ecosystem development, and manufacturing scale. These are solvable problems on known timelines, not fundamental research challenges. Energy efficiency standards for AI compute that account for photonic alternatives could accelerate the transition, though premature mandates risk locking in specifications before the technology stabilizes.

The Bottom Line

AI's energy crisis is real. Data center electricity demand is tripling by 2027, nuclear power plants are being brought online specifically to run neural networks, and Meta just raised its capex guidance to $125 to $145 billion for the year, with a significant fraction dedicated to power and cooling. The industry's default answer to this problem has been to build more power plants. Three labs in six weeks just demonstrated that the better answer might be to change what the electricity does when it arrives. A laser, a piece of glass, and a camera. Thirty watts. Fifty TeraOPS per watt. Physics works. Engineering is catching up. And the software ecosystem, unlike the laws of thermodynamics, can be rewritten.

Related Articles