What is compositional generalization in robotics?

Compositional generalization is a robot model's ability to combine skills learned from different training contexts to solve entirely new tasks. For example, Physical Intelligence's π0.7 model combined fragments from two unrelated air fryer episodes in its training data to figure out how to cook a sweet potato, a task it was never explicitly trained to perform. This mirrors how large language models can combine concepts from separate training documents to produce novel outputs.

How much data does Physical Intelligence's π0.7 robot model need compared to large language models?

The gap is enormous. GPT-4-class language models trained on an estimated 10-15 trillion text tokens scraped from the internet. The largest open robotics dataset, Open X-Embodiment, contains roughly 1 million episodes from 22 robot types. Even assuming Physical Intelligence has 10 million proprietary episodes, that represents a data ratio of roughly 1 million to 1 compared to LLMs. Yet π0.7 shows early signs of the same emergent compositional behavior.

What is Physical Intelligence's valuation and has it generated any revenue?

Physical Intelligence was valued at $2 billion in November 2024, reached $5.6 billion by late 2025, and is currently in discussions for a $1 billion raise at an $11 billion valuation as of March 2026. The company has no commercial product, no publicly disclosed customers, and no deployment timeline. Its co-founder Sergey Levine declined to speculate on when a deployable product might be ready.

Physical Intelligence Trained Its Robot on 2 Air Fryer Episodes. It Cooked a Sweet Potato. The Valuation: $11 Billion.

Two. That is the number of air fryer episodes in Physical Intelligence's entire training dataset. In one, a different robot pushed the appliance closed. In the other, scraped from an open-source collection, yet another robot placed a plastic bottle inside one on someone's instructions. Neither episode involved cooking, and neither involved a sweet potato.

π0.7 cooked a sweet potato anyway.

On April 16, 2026, the San Francisco-based startup published research showing that its latest robotic foundation model can perform tasks it was never explicitly trained on, combining fragments of learned skills in new configurations to solve novel problems. The researchers have a term for this capability: compositional generalization. If you follow AI, you have seen it before. It is the same property that lets a large language model translate English to French in JSON format despite never seeing that specific combination in its training data. It is also the property that separates a genuine foundation model from a very expensive lookup table.

Nobody had demonstrated it convincingly in robotics until now. And that distinction, if it survives scrutiny, is worth understanding in detail, because it implies that robotic AI may have crossed an inflection point where capabilities begin compounding faster than the underlying data would predict.

What π0.7 Actually Does

Every prior approach to training robot models worked like this: collect data on a specific task, train a specialist model on that data, deploy it, then repeat the entire process for the next task. Want a robot that folds laundry? Collect 10,000 laundry-folding episodes, train a model, ship it. Want the same robot to also make coffee? Collect another 10,000 episodes, train a second model, ship that too. The robot does not learn to fold laundry and then figure out that similar motions might work for packing boxes. It memorizes. It does not compose.

π0.7, according to Physical Intelligence's researchers, breaks this pattern. Trained on data from many different robots, human demonstration videos, and autonomous episodes collected by running various policies, the model accepts multimodal prompts that specify not just what the robot should do, but how it should do it: textual task descriptions, visual subgoal images, speed and quality metadata, and control modality labels. This diverse conditioning framework lets it integrate data sources that would conflict under naive merging, because the prompt disambiguates the intended behavior.

In testing, π0.7 matched the performance of Physical Intelligence's own fine-tuned specialist models across tasks including coffee-making, laundry folding, and box assembly. That alone is noteworthy, since generalist models in robotics have historically performed 20-40% worse than task-specific systems. But the air fryer result is the one that matters, because it demonstrates synthesis rather than memorization: the model inferred how an appliance works from two tangentially related fragments plus web-based pretraining data, and then, with verbal step-by-step coaching from a human, successfully used it to cook food.

"It's very hard to track down where the knowledge is coming from, or where it will succeed or fail," Lucy Shi, a Physical Intelligence researcher and Stanford computer science Ph.D. student, told TechCrunch.

Running the Data Gap Math

Here is a calculation nobody has published. Large language models like GPT-4 trained on an estimated 10 to 15 trillion text tokens, scraped from a corpus that includes essentially the entire indexed internet, digitized books, scientific papers, and code repositories. The internet generates roughly 2.5 quintillion bytes of new data per day. For language models, the data supply problem is effectively solved: there is more text than any model can consume.

Robots have no equivalent internet to scrape. Open X-Embodiment, the largest publicly available robotics dataset, contains approximately 1 million episodes from 22 different robot embodiments, contributed by 21 research institutions worldwide. Even generously assuming Physical Intelligence's proprietary dataset is ten times larger (the company does not disclose its data size), that yields roughly 10 million episodes.

Compare the two:

Metric	LLMs (GPT-4 class)	Robotics (π0.7 est.)	Ratio
Training data units	~13 trillion tokens	~10 million episodes	1,300,000:1
Data generation rate	~500 billion tokens/day (internet)	~1,000 episodes/day (est. all labs)	500,000,000:1
Cost per data unit	~$0 (scraped)	~$5-50 per episode (teleoperation)	n/a
Embodiment diversity	1 (text)	22+ robot types	Robots face harder transfer

At GTC 2026 in March, nearly every robotics company on the show floor was working with the same three-part data stack: synthetic environments, teleoperation recordings, and first-person egocentric footage. Nearly every team on the show floor acknowledged the same bottleneck. Synthetic data is cheap but does not transfer well to unpredictable real-world environments. Teleoperation data is high quality but costs $5 to $50 per episode and does not scale. Egocentric data collection infrastructure, meanwhile, barely exists outside a handful of well-funded labs.

This gap is between six and nine orders of magnitude. Yet π0.7 shows early signs of compositional generalization with training data that is roughly one millionth the size of what LLMs consumed before exhibiting similar emergent properties. Two possibilities explain this. Either robotics has a fundamentally more favorable data-to-capability scaling curve than language, meaning physical tasks contain more transferable structure per episode than text tokens do per word, or the generalization claims are narrower than they appear, and the model is interpolating between similar tasks rather than genuinely composing novel behaviors.

$11 Billion for Zero Customers

Physical Intelligence was valued at $2 billion in November 2024 when Jeff Bezos, Thrive Capital, and Lux Capital led a $400 million round. By late 2025, the company had raised $600 million more at a $5.6 billion valuation. In March 2026, Bloomberg reported that the company was in discussions to raise another $1 billion at $11 billion, nearly doubling its valuation in four months.

5.5 times in 16 months, with no product, no customers, and no deployment timeline.

Asked when a system based on these findings might ship, co-founder Sergey Levine, a UC Berkeley professor, declined to speculate. "I think there's good reason to be optimistic, and certainly it's progressing faster than I expected a couple of years ago. But it's very hard for me to answer that question."

For $11 billion to pencil out at a standard 10x forward revenue multiple, Physical Intelligence needs $1.1 billion in annual revenue by roughly 2031. The global industrial robotics market is worth about $55 billion per year and growing at roughly 10% annually, according to the International Federation of Robotics. Service robotics adds another $45 billion. If Physical Intelligence captures 1-2% of that combined $100 billion market within five years, the valuation works. If it becomes the platform layer that every robot manufacturer licenses, it could be conservative.

That is the bet investors are making. Not that Physical Intelligence will sell robots, but that it will sell the brain, the way Nvidia sells the GPU and Qualcomm sells the modem. Lachy Groom, the co-founder who previously backed Figma, Notion, and Ramp as an angel investor, has staked his full-time career on this thesis.

5% to 95% in 30 Minutes of Coaching

Buried in the TechCrunch interview is a detail that deserves more attention than it received. When Physical Intelligence first tested π0.7 on the air fryer task, the success rate was 5%. After researchers spent approximately 30 minutes refining how the task was explained to the model, adjusting the step-by-step verbal instructions, the success rate jumped to 95%.

"Sometimes the failure mode is not on the robot or on the model," Shi said. "It's on us. Not being good at prompt engineering."

If this sounds familiar, it should. Large language models showed the same prompt sensitivity before the field converged on techniques like RLHF and chain-of-thought prompting. GPT-3 in 2020 was wildly inconsistent; ChatGPT in 2022 applied the same underlying model with a far more reliable interaction layer. Physical Intelligence may be at the GPT-3 stage: the capability exists but requires expert-level prompting to unlock reliably.

For deployment, this is a significant open question. In a factory, who writes the prompts? If a robot fails at a task, does the line supervisor spend 30 minutes re-coaching it, or does the factory call Physical Intelligence's support team? Prompt sensitivity works fine in a research lab where every interaction is supervised by a Stanford Ph.D. student. It is less clear how it scales to 10,000 robots on warehouse floors where the operators are not AI researchers.

No Standardized Benchmarks Exist

Physical Intelligence measured π0.7 against its own prior specialist models and found equivalent performance. This is the robotics field's dirty secret: there are no standardized benchmarks for manipulation generalization. Open X-Embodiment established shared datasets but not shared evaluation protocols. Every company measures itself against its own prior work, which is like a student grading their own exam and reporting that they outperformed last semester's version of themselves.

Sergey Levine addressed this directly in the TechCrunch interview, noting that "the criticism that can always be leveled at any robotic generalization demo is that the tasks are kind of boring." He pushed back on the framing, arguing that the distinction between an impressive robot demo and one that genuinely generalizes is precisely the point: generalization will always look less dramatic than a choreographed stunt, but it is considerably more useful.

He is right about that. He is also, unavoidably, asking the public to trust his team's self-evaluation of a capability that no independent party can yet verify.

Strongest Counterargument

The strongest case against π0.7 as a genuine inflection point is that we have seen this demo before. Google DeepMind's RT-2 in 2023 demonstrated "emergent" robotic behaviors, including novel object manipulation and rudimentary reasoning, with similarly impressive cherry-picked demonstrations. Three years later, RT-2 has not produced a commercial product. Boston Dynamics, which recently partnered with Google DeepMind to bring foundational intelligence to its Atlas humanoid, is still primarily selling manually programmed industrial solutions to warehouse operators who want predictability, not emergent behavior.

IBM Watson won Jeopardy! in 2011 and never found a viable product market. AlphaFold solved protein structure prediction in 2020 and, while scientifically profound, has not produced a blockbuster drug. The history of AI is littered with capabilities that were genuine but insufficient: real breakthroughs that could not bridge the gap between a controlled demonstration and a deployable system operating reliably at scale, in environments the researchers did not control, under conditions they did not anticipate.

Physical Intelligence has produced a genuine capability. Whether it has produced a product is a question that $11 billion is betting on and that Sergey Levine himself will not answer.

What We Cannot Verify

Physical Intelligence has not disclosed the total size, composition, or compute cost of its training data. The company has not published evaluation protocols that independent researchers could replicate. "Compositional generalization" is evaluated against the company's own prior models, using demonstrations selected by the company's own researchers. The air fryer demo is striking, but the training dataset is opaque: we cannot independently confirm that only two relevant episodes existed, because we do not have access to the dataset.

All robotics valuation comparisons are speculative. No pre-revenue company has successfully sold a general-purpose robot brain at scale. OpenAI reached $3.4 billion in annualized revenue within two years of launching ChatGPT, but language has the internet as a distribution channel. Robots require physical hardware integration, safety certification, and deployment infrastructure that text-based AI does not.

Ashwin Balakrishna, a research scientist at Physical Intelligence, told TechCrunch that the last few months have been "the first time where I'm genuinely surprised" by the model's capabilities. Researcher surprise is not evidence. But it is notable when the surprise comes from people who built the system and know exactly what is in the training data.

The Bottom Line

If π0.7 is the GPT-3 of robotics, the implications compress the timeline for general-purpose robots from "maybe a decade" to "maybe five years." A robot brain that composes skills rather than memorizing tasks could, in principle, be deployed into any factory, kitchen, or warehouse and taught on-site rather than retrained from scratch. That would make Physical Intelligence the most important robotics company since Boston Dynamics, and $11 billion would look cheap in retrospect.

If it is closer to Watson, the demo is real but the product-market gap is wider than the technology can bridge, and the valuation is pricing in a future that the data cannot yet support.

For robotics engineers and startup founders: the multimodal prompting framework (language + visual subgoals + metadata) is the technical contribution worth studying, regardless of whether π0.7's generalization claims hold up at scale, because it solves the real problem of integrating heterogeneous robot data sources without losing signal in the merge. For investors evaluating robotics AI: demand independent benchmarks before writing checks against "emergent" capabilities that the company measured against its own previous models. For factory operators and logistics companies weighing robotics adoption: the technology is not ready for deployment today and Physical Intelligence's own co-founder will not say when it will be, so budget for at least two to three more years of pilot programs and manual integration before anything resembling a general-purpose robot brain ships as a product.

Sergey Levine compared π0.7 to GPT-2's unicorn-in-the-Andes moment: a weird, wonderful capability that nobody expected to emerge from the data. "Seeing that in robotics," he said, "is really special." He is not wrong. What happens next will determine whether it is also worth $11 billion.