← Back to all articles 💻 Computing

On-Device Speech Recognition Burns 30–100x More Energy Than Sending Audio to the Cloud. Privacy Is That Expensive.

A first-of-its-kind energy-per-utterance comparison across ten automatic speech recognition systems reveals a counterintuitive truth: running inference on your phone’s neural engine costs thousands of millijoules while a WiFi round-trip to a cloud model costs tens. Apple and Google spend billions engineering the less efficient path, and the reason is not technical — it’s legal.

Smartphone on a desk showing audio waveforms with energy visualization comparing on-device and cloud speech recognition paths

Three thousand millijoules. That is the minimum energy your iPhone’s Apple Neural Engine burns to transcribe a 30-second voice memo on-device, hammering through a conformer-based encoder-decoder model derived from OpenAI’s Whisper architecture at 1–2 watts of sustained inference power. Sending that same audio to Google’s Chirp 2 over WiFi? Twenty millijoules. Maybe a hundred, if you are far from the router and the radio has to shout.

That ratio — 30 to 100 times more energy for local processing — has never been explicitly calculated and published. It should be, because it demolishes one of the most persistent assumptions in consumer technology: that keeping computation on-device is the greener, more efficient choice.

It is not — not for speech recognition, and not even close to close.

Ten Systems, One Question

We compared ten automatic speech recognition systems spanning on-device, cloud API, and open-source deployments, and asked a question nobody had bothered to quantify: how many millijoules does each path consume to transcribe a single 30-second utterance from microphone to text?

The systems: Apple’s SpeechAnalyzer (on-device, powering Siri and dictation), Google’s on-device conformer models (Pixel Tensor G4 TPU), Google Cloud Speech-to-Text Chirp 2 (2-billion-parameter cloud model), Deepgram Nova-3 (streaming cloud at $4.30 per thousand minutes), OpenAI’s Whisper Large V3 (1.55 billion parameters, 99+ languages, open source), NVIDIA’s Canary Qwen 2.5B (current accuracy champion at 5.63% word error rate on the Hugging Face Open ASR Leaderboard), IBM’s Granite Speech 3.3 (5.85% WER, Apache 2.0 licensed), Moonshine v2 by Useful Sensors (edge-optimized, under 60 million parameters), AssemblyAI Universal-2, and Speechmatics.

The accuracy spread is enormous: NVIDIA Canary hits 5.63% WER but demands an NVIDIA GPU and speaks only English. Whisper Large V3 manages 7.4% across 99 languages. On-device models from Apple and Google land somewhere around 8–12%, though neither company publishes official benchmarks, while Moonshine squeezes competitive accuracy from a model small enough to run on a Raspberry Pi. The full leaderboard:

SystemWERParametersLanguagesDeployment
NVIDIA Canary Qwen 2.5B5.63%2.5BEN onlyGPU server
IBM Granite Speech 3.35.85%~9BEN + translationGPU server
OpenAI Whisper Large V37.4%1.55B99+Cloud API / self-host
Whisper Large V3 Turbo7.75%809M99+Cloud API / self-host
Apple SpeechAnalyzer~8–12%*~80–150M~20On-device (ANE)
Google On-Device~8–12%*~80–200M8+ offlineOn-device (Tensor TPU)
Deepgram Nova-3~8–11%Undisclosed47+Cloud API
Moonshine Base~9–12%~60MENEdge CPU

*Estimated from independent testing; Apple and Google do not publish official WER figures for on-device models.

But accuracy is not where the surprise lives.

The Energy Math Nobody Published

Here is the calculation, broken into its physical components for a 30-second utterance.

Cloud path (WiFi): Encode 30 seconds of audio using the Opus codec at 32 kbps, which produces roughly 120 kilobytes. A WiFi radio transmitting at 100–500 milliwatts pushes that payload in approximately 0.1 seconds, consuming 10–50 millijoules. The phone then idles for 200–500 milliseconds while the cloud model processes (WiFi idle draw: 50–100 milliwatts, adding another 10–50 millijoules). The result returns as a sub-kilobyte text payload — negligible — bringing the total to 20–100 millijoules.

Cloud path (LTE): Same audio, but the cellular radio draws 500–2,000 milliwatts active, for a total of 100–400 millijoules.

Cloud path (5G): The 5G modem is a power hog — 1,000–3,000 milliwatts active, as documented by teardown analyses and confirmed by Apple’s own battery life disclosures showing measurably shorter runtime on 5G versus LTE, for a total of 200–600 millijoules.

On-device path: Apple’s A18 Pro Neural Engine runs at approximately 1–2 watts during active inference. For a 30-second utterance, the conformer model does not process all 30 seconds simultaneously — it chunks audio into frames and runs inference in bursts. Realistic total compute time: 2–5 seconds of active ANE utilization, yielding an energy cost of 3,000–10,000 millijoules.

Google’s Tensor G4 TPU on the Pixel 9 draws a comparable 1–3 watts during inference, landing in the same range. Moonshine on a bare ARM CPU without a neural accelerator draws under 500 milliwatts but takes longer to process, converging on a similar energy envelope.

The ratio is stark: WiFi cloud is 30–100 times cheaper in energy per utterance. Even 5G — the least efficient wireless path — burns less than a fifth of what local inference costs, because the radio is cheap and the neural engine is expensive.

Why “Local = Green” Is Wrong for ASR

The intuition that local processing saves energy comes from a correct observation in other domains. Reading a cached file from local storage is cheaper than fetching it over the network. Running a simple calculation on the CPU beats a round-trip to a server. But speech recognition is not a simple calculation. It is hundreds of millions of multiply-accumulate operations through a deep neural network, sustained over multiple seconds, on silicon that was designed to fit inside a phone case with a thermal envelope measured in single-digit watts.

Cloud ASR servers, by contrast, run the same class of model on hardware purpose-built for inference: NVIDIA A100 or H100 GPUs, Google TPU v5 pods, custom ASICs from Deepgram and AssemblyAI. A single GPU can process hundreds of concurrent audio streams. The per-utterance energy allocation on the server is a rounding error — a few hundred millijoules amortized across a rack that would be powered regardless. The expensive part for the user is the radio transmission, and radios are astonishingly efficient at moving small payloads.

This matters for the carbon arithmetic: an iPhone 16 Pro battery holds approximately 50,000 joules. At 3,000–10,000 mJ per on-device transcription, you can run roughly 5,000–16,000 utterances before the battery dies from ASR alone. Switch to WiFi cloud and the same battery supports 500,000–2,500,000 utterances. For a device that already struggles to last a full day under normal usage, the energy budget for on-device ASR is not trivial.

The Exception: Always-On Listening

There is one scenario where on-device wins unambiguously: hotword detection. “Hey Siri,” “OK Google,” and “Alexa” run continuously on dedicated low-power DSP cores drawing microwatts, not milliwatts, listening for a specific acoustic pattern without performing full transcription. The alternative — streaming all ambient audio to a cloud server 24 hours a day — would drain the battery in hours and create a privacy catastrophe that would make a GDPR enforcement officer spontaneously combust. On-device hotword detection is efficient, necessary, and not what this analysis measures.

Full transcription — the kind that converts your 30-second voice message into text, that powers real-time captions, that runs dictation in Notes — is a different beast entirely, and it is that beast that costs 30–100x more to feed locally.

What This Does Not Prove

These energy estimates derive from published SoC thermal design power specifications, WiFi radio power consumption data from IEEE 802.11 standards documentation, and inference duration measurements from open-source ASR benchmarks. They are not direct measurements taken with a power monitor on individual utterances. Apple does not publish per-inference energy consumption for the Neural Engine. Google provides aggregate TPU power data but not per-utterance breakdowns. The 30–100x ratio reflects the order of magnitude from derived estimates, not laboratory calorimetry.

Additionally, the cloud path energy calculation counts only the client-side cost — the energy spent by the phone to send and receive data. Server-side energy, amortized per utterance, is small but nonzero and is not included. A full lifecycle comparison including data center cooling, amortized hardware manufacturing, and network infrastructure energy would narrow the gap somewhat, though not enough to reverse the direction.

Word error rate comparisons across different test sets and normalization schemes are notoriously unreliable. The WER figures cited here come from different benchmarks (Hugging Face Open ASR Leaderboard for open-source models, vendor-published benchmarks for commercial APIs, independent testing for on-device systems) and should be treated as approximate relative positions rather than precise rankings.

The Strongest Case for On-Device Regardless

If GDPR did not exist, if CCPA had never passed, if users had zero concern about their microphone data traversing a network to a server farm in Virginia — on-device ASR would make no engineering sense for intermittent transcription. Cloud models are bigger, more accurate, and 30–100x more energy efficient per utterance on the client. Apple and Google invest billions annually in on-device ASR not because it is better in any measurable technical dimension, but because the alternative is politically and legally untenable.

That is the strongest case for on-device: it exists because it must. Privacy is a feature that costs energy, and in 2026, Apple’s Neural Engine and Google’s Tensor TPU are the silicon that pays the bill. The market has decided that the bill is worth paying. Users overwhelmingly prefer that their voice data stays on their phone, even if they could not articulate the energy cost of that preference. The cost is real, and the preference is also real — engineering exists to serve both, not to pretend one is free.

What You Can Do

If you build mobile apps with speech input: Default to on-device ASR for privacy-sensitive contexts (health, finance, personal notes), but offer cloud ASR as an opt-in for accuracy-critical workflows where users explicitly consent to server processing. The accuracy gap — 5.6% WER (cloud SOTA) versus 8–12% (on-device) — is large enough to matter for medical dictation, legal transcription, and accessibility tools, so disclose the trade-off honestly rather than pretending on-device is both more private and more accurate.

If you design SoC architectures: The 30–100x energy gap between neural inference and radio transmission will narrow as accelerators improve, but it will not close. The physics of moving 120 kilobytes over WiFi will always be cheaper than running 80–150 million parameters through a matrix multiply. The leverage point is model distillation: Moonshine v2 already achieves competitive accuracy at 60 million parameters on bare ARM cores. Smaller, cheaper models beat bigger, hungrier accelerators for edge ASR.

If you set corporate sustainability targets: “We process everything on-device” is not an environmental claim — for ASR workloads, it is the opposite. If your ESG reporting counts on-device inference as zero-carbon because it avoids a data center, your accounting is wrong: count the client-side inference energy, because it is measurable and it is not zero.

The Bottom Line

Speech recognition has entered a two-tier reality in 2026. Cloud models led by NVIDIA Canary (5.63% WER) and Whisper Large V3 (7.4% WER across 99 languages) deliver the best accuracy at the lowest per-utterance energy cost. On-device models from Apple and Google trade 30–100x more energy and 2–6 percentage points of accuracy for something no cloud model can offer: the guarantee that your voice never leaves your hand. That guarantee is worth billions in annual R&D spending by the two largest mobile platform companies on Earth, not because the engineering favors it, but because privacy law and user trust demand it. The next time your phone transcribes a voice memo without an internet connection, know that the silicon is working 30 times harder than it would need to if it were allowed to phone home. It is not allowed. That is the point.