π_0.7: a Steerable Model with Emergent Capabilities

Published

April 162026

Paper

π_0.7.pdf

Loading…

We’ve trained a new modelπ_0.7that exhibits a step-change in generalization. π_0.7 is a general-purpose model that can perform a wide range of dexterous tasks with the same performance as fine-tuned specialistsbut even more importantlyit can follow new language commands and perform tasks that were never seen in its training data. In our experimentswe see π_0.7 exhibiting the first signs of compositional generalizationrecombining skills from various tasks to solve new problemslike using new kitchen appliances and even enabling a new robot to fold laundry for which there is no laundry folding data.

While this kind of generalization has always been thought of as a key strength of robotic foundation modelsactual models demonstrated to date have not shown the kind of broad compositional generalization that we’ve seenfor examplefrom LLMs. LLMs can compose concepts from their training data in new ways: if an LLM knows how to translate English to Frenchand it knows how to produce JSON outputit can provide translations formatted in JSON format. Vision-language-action models can understand diverse semantic conceptsbut have not yet been shown to combine skills in new wayslike using a new tool or kitchen appliance. Even for skills that are seen in trainingbest results are typically obtained by fine-tuning such models to that skillmuch like how early language models were fine-tuned for specific problem domains. A true generalist model should perform all of the skills out of the boxand be able to recombine them to solve new tasks. π_0.7 demonstrates initial signs of such general capability: it can perform dexterous manipulation skills like those we’ve previously shown with our RL fine-tuned π*_0.6 specialist modelswith the same speed and robustnessit can compose and recombine the skills it learned to solve new tasksand it can generalize across robot platformsscenesand tasks more effectively than our prior models. The examples below illustrate this breadth of capabilityfrom fine manipulation to long-horizon household behaviors all with one modelstraight out of the box.

What makes π_0.7 generalize so broadly? The key to generalization for foundation models is to use broad and diverse datawhich in our case includes data from many different robotshuman dataand even autonomous episodes collected by running various policies. Merging all these data sources naively does not lead to good results. We find that the key to using all of these data sources to attain compositional generalization is to add diverse context to the prompt: training the model with a variety of multimodal prompt structures that specify not only what the robot should dobut how it should do it. The prompt can include not just a textual description of the taskbut a variety of other annotations and modalities. For exampleproviding the model with a visual subgoal defines a precise spatial layout of objects. Providing the desired length of the episode specifies how quickly the task should be done. Criticallyall of these pieces of information disambiguate the behaviorenabling diverse data with different strategiesbehaviorsand levels of proficiency to be included in training. At test timeour model accepts standard language instructionsbut also information about the desired strategyand even synthetically generated visual subgoals produced by a lightweight world model. We show some examples of what π_0.7 can do below.

Leveraging more data with diverse conditioning

The different prompt modalities allow π_0.7 to integrate a wide range of diverse data sourcesincluding data from different robots and control modalitieshuman videosand autonomous data. While our prior models also used some of these data sources (e.g.videos)π_0.7 unifies these under a single prompting frameworksupporting:

Diverse language that describes the task and individual sub-steps.
Metadata that describes how the task was performedsuch as speed and quality.
Control modality labels that indicate whether to use joint or end-effector control.
Visual subgoal images that show what the end of the current sub-step should look like. These images can be generated at test time by a world model that provides for visual generalization.

With these different annotation sourcesπ_0.7 can leverage more types of data. For examplesuboptimal autonomous evaluation datawhich would ordinarily risk teaching the model to perform lower-quality actionscan be incorporated by annotating it with appropriate metadata (e.g.lower quality or lower speed).

π_0.7 is a steerable generalist model that can perform dexterous tasks across robotsscenesand skills. It is trained with diverse multimodal prompts — languagemeta-datacontrol modalitiesand visual subgoals — that describe not just what to do but how to do itallowing it to leverage broad data sources and recombine skills to solve new tasks.

Training Time

Language Instructions

pick up the oven mitt

open the drawer

grab the spatula

place pot on the stove

fold the towel

wipe the counter

close the fridge

Subgoal Images

Episode Metadata

Quality

Speed

π_0.7 VLAVision-Language-Action Model

Observation
MemoryObservation Memory

Prompt

Action Expert

Inference Time

High-Level Policy

TASK INSTRUCTION

SUBTASK INSTRUCTION

World Model

SUBTASK INSTRUCTION

SUBGOAL

Desired Metadata

Quality

Speed

Loading…

Out-of-the-Box Performance Without Fine-Tuning

Loading…

Specialist-Level Dexterity

Loading…

Cross-Embodiment Transfer

Training Time

Language Instructions

pick up the oven mitt

open the drawer

grab the spatula

place pot on the stove

fold the towel

wipe the counter

close the fridge

Subgoal Images

Episode Metadata

Quality

Speed

π_0.7 VLAVision-Language-Action Model

Observation
MemoryObservation Memory

Prompt

Action Expert

Inference Time

High-Level Policy

TASK INSTRUCTION

SUBTASK INSTRUCTION

World Model

SUBTASK INSTRUCTION

SUBGOAL

Desired Metadata

Quality

Speed

Loading…

Out-of-the-Box Performance Without Fine-Tuning

Loading…

Specialist-Level Dexterity

Loading…

Cross-Embodiment Transfer

Compositional task generalization

One of the toughest generalization challenges for robotic foundation models is following user prompts to perform a new task. π_0.7 shows early signs of compositional task generalization through a combination of diverse language instructionslanguage coachingand visual subgoals. We first observed this emergent ability when we tasked the model to operate a variety of kitchen appliances. We did not collect demonstrations of these specific appliance tasksand instead tried to prompt the model to operate them. For each appliancethe robot received language coaching for using the appliance – step by step language commands similar to those that could guide a person using the appliance for the first time. When we ask the robot to do a new taskusing an air fryer appliance to cook a sweet potatoit makes a reasonable attemptperforming part of the task after a few false startsbut not finishing it fully:

Loading…

π_0.7 attempting to use an air fryer with only a zero-shot prompt: "load a sweet potato into the air fryer".

Howeverif we walk it through the task with step-by-step language coachingit performs the task much more effectively. This is harder than it seems. It requires understanding the fine-grained instructions and grounding them correctly:

Loading…

π_0.7 using the air fryer with step-by-step verbal coaching.

After we’ve provided language coaching to the robot multiple timeswe can use the instructions to fine-tune a high level policy that can then generate the language subgoals fully autonomouslysignificantly improving fully autonomous execution of the task without any additional teleoperation at all. The robot has effectively learned the task from language coaching:

Loading…

π_0.7 performing the air fryer task with a fine-tuned high-level policy generating language subtasks. We also visualize the subgoal images produced by our world model for each language subtask. The language subtask and subgoal images are provided to π_0.7 to perform the task fully autonomously.

We wanted to understand where the robot learned what an air fryer even is. The size and diversity of our training set makes it hard to track down the precise episodes that informed this behaviorand the knowledge likely comes from a combination of robot episodes and web-scale vision-language pre-training. After a lot of searchingwe found two episodes we collected in a home where a robot closes an air fryer (labeled “push the frying basket into the airfryer” and "put the basket of the airfryer on the leftmost side of the counter")and data from the open-source DROID dataset on a Franka robot. These episodes look quite different from what the mobile robot actually does in our experimentssuggesting that π_0.7 can generalize and compose behaviors to load the sweet potato into the air fryermuch like how an LLM composes different parts of text seen in large-scale datasets from the web:

Loading…

The closest episodes we found to the air fryer task: two episodes closing air fryersand data from the open-source DROID dataset with a Franka arm.

With the improved generalization and language following capabilities of π_0.7we can direct it with language to perform a wide variety of tasksinteractively “teach” new behaviorsand even have a bit of fun with precise and varied language commands!

Loading…

Interactively directing π_0.7 with varied language commands.

Cross-embodiment transfer

π_0.7 shows some of the most effective generalization of tasks across embodiments that we’ve seen. One of the most underrepresented embodiments in our training set is the bimanual UR5e systemconsisting of two UR5e industrial arms with Robotiq parallel jaw grippers. This robot is hard to teleoperate: the heavy arms have a lot of inertiaand the grippers are relatively imprecise. We tasked π_0.7 to control this robot to fold laundryeven though we did not collect any data of laundry folding with this robotand to our surprise it could do this consistently. Note that the physical motion of the robot when folding t-shirts differs significantly from the (much smaller) robot that we used to collect t-shirt folding data:

Loading…

We collected data of laundry folding with the static bimanual robot (left)and then evaluated π_0.7 on this task with the bimanual UR5e system (right). No training data was collected for this task with the UR5e bimanual system. Because the robots differ significantly in sizepositioningand morphologyπ_0.7 has to employ a substantially distinct strategy with the UR5e. Its success rate matches the "zero shot" success rate of expert teleoperators who have performed the task on the source robotbut attempt it on the UR5e for the first time.

The success rate of π_0.7 on this task actually matches the "zero shot" success rate of expert human teleoperators who had collected the training data for this task on the original robotwhen the same teleoperators were asked to perform the task with the bimanual UR5e system. These teleoperators had a mean of 375 hours of teleoperation experience.

Conditioning on speed and optimality

Besides broad generalizationwe also would like our models to achieve high success rates and to perform tasks quickly. In our recent workwe introduced Recapan algorithm for training policies with RL to optimize for robustness and throughput. While Recap provided an effective way to optimize policieswith π_0.7 we were able to have a single general-purpose model that could perform all of the tasks that we optimized with Recap with the same success rate and (sometimes even higher) throughput by distilling experience generated during Recap training into the π_0.7 model with strategy metadata. The same π_0.7 model can perform the laundry foldingespresso makingand box folding tasks to the same or even higher level of performance as the best models trained with Recap:

π_0.7

π*_0.6 Specialist

Normalized Throughput (vs. specialist)

Success Rate (%)

Laundry (T-Shirts and Shorts)

1.5

1.2

0.9

0.6

0.3

Loading…

Laundry (T-Shirts and Shorts)

100

Laundry (Diverse - Hardest Item)

1.6

1.2

0.8

0.4

Loading…

Laundry (Diverse - Hardest Item)

100

Make Espresso

1.5

1.2

0.9

0.6

0.3

Loading…

Make Espresso

100

Box Building

1.6

1.2

0.8

0.4

Loading…

Box Building

100

Laundry (T-Shirts and Shorts)

1.5

1.2

0.9

0.6

0.3

Laundry (Diverse - Hardest Item)

1.6

1.2

0.8

0.4

Make Espresso

1.5

1.2

0.9

0.6

0.3

Box Building

1.6

1.2

0.8

0.4

Loading…

Laundry (T-Shirts and Shorts)

100

Laundry (Diverse - Hardest Item)

100

Make Espresso

100

Box Building

100

A quantitative comparison of the single π_0.7 model performing each of the tasks introduced in our Recap blog postin comparison with the RL-trained specialist policy for each task. The y-axis is normalized by the throughput of the specialist. By training on diverse data, including autonomous data from the RL-trained modelsthe single π_0.7 model attains similar or even stronger performance across tasks as the task-specific RL-trained specialists.

Out-of-the-box performance on dexterous tasks

π_0.7 is a general-purpose modelin the sense that it can control a wide variety of different robots to perform a wide range of different tasks. Besides the specific controlled experiments that we discuss above that evaluate particular axes of performance and generalizationwe also tested a wide range of tasks that include peeling vegetablescleaning a glass door with Windexand more. We show a selection of some of the tasks that π_0.7 can perform with various robotic platforms below.

What’s next?

π_0.7 is a single unified model with emergent compositional generalizationthe ability to follow diverse instructions and visual subgoalsand strong out-of-the-box performance even on tasks that previously required fine-tuned specialist models. Powerful and steerable models like π_0.7 might make it possible in the future to solve even more complex unseen tasks by having the model “think through” possible ways to perform themleverage its ability to follow diverse prompts to ground these thoughts in actionsand then reflect on the outcomes to revise the task plan. Effective prompt following and generalization is thus useful not only for allowing people to better specify what they want a robot to dobut also for grounding the semantic reasoning and problem solving capabilities of modern foundation modelsallowing them to translate their semantic generalization capabilities into physical generalization.

If you are excited about working on these problems and would like to join usget in touch! And if you would like to collaborate with us and apply models like π_0.7 to real-world robotics applicationsyou can contact us at [email protected].

π0.7: a Steerable Model with Emergent Capabilities

Leveraging more data with diverse conditioning

Compositional task generalization

Cross-embodiment transfer

Conditioning on speed and optimality

Out-of-the-box performance on dexterous tasks

What’s next?

π_0.7: a Steerable Model with Emergent Capabilities