×
注意!页面内容来自https://www.pi.website/blog/pi07,本站不储存任何内容,为了更好的阅读体验进行在线解析,若有广告出现,请及时反馈。若您觉得侵犯了您的利益,请通知我们进行删除,然后访问 原网页
We’ve trained a new modelπ0.7that exhibits a step-change in generalization. π0.7 is a general-purpose model that can perform a wide range of dexterous tasks with the same performance as fine-tuned specialistsbut even more importantlyit can follow new language commands and perform tasks that were never seen in its training data. In our experimentswe see π0.7 exhibiting the first signs of compositional generalizationrecombining skills from various tasks to solve new problemslike using new kitchen appliances and even enabling a new robot to fold laundry for which there is no laundry folding data.
While this kind of generalization has always been thought of as a key strength of robotic foundation modelsactual models demonstrated to date have not shown the kind of broad compositional generalization that we’ve seenfor examplefrom LLMs. LLMs can compose concepts from their training data in new ways: if an LLM knows how to translate English to Frenchand it knows how to produce JSON outputit can provide translations formatted in JSON format. Vision-language-action models can understand diverse semantic conceptsbut have not yet been shown to combine skills in new wayslike using a new tool or kitchen appliance. Even for skills that are seen in trainingbest results are typically obtained by fine-tuning such models to that skillmuch like how early language models were fine-tuned for specific problem domains. A true generalist model should perform all of the skills out of the boxand be able to recombine them to solve new tasks. π0.7 demonstrates initial signs of such general capability: it can perform dexterous manipulation skills like those we’ve previously shown with our RL fine-tuned π*0.6 specialist modelswith the same speed and robustnessit can compose and recombine the skills it learned to solve new tasksand it can generalize across robot platformsscenesand tasks more effectively than our prior models. The examples below illustrate this breadth of capabilityfrom fine manipulation to long-horizon household behaviors all with one modelstraight out of the box.
What makes π0.7 generalize so broadly? The key to generalization for foundation models is to use broad and diverse datawhich in our case includes data from many different robotshuman dataand even autonomous episodes collected by running various policies. Merging all these data sources naively does not lead to good results. We find that the key to using all of these data sources to attain compositional generalization is to add diverse context to the prompt: training the model with a variety of multimodal prompt structures that specify not only what the robot should dobut how it should do it. The prompt can include not just a textual description of the taskbut a variety of other annotations and modalities. For exampleproviding the model with a visual subgoal defines a precise spatial layout of objects. Providing the desired length of the episode specifies how quickly the task should be done. Criticallyall of these pieces of information disambiguate the behaviorenabling diverse data with different strategiesbehaviorsand levels of proficiency to be included in training. At test timeour model accepts standard language instructionsbut also information about the desired strategyand even synthetically generated visual subgoals produced by a lightweight world model. We show some examples of what π0.7 can do below.
The different prompt modalities allow π0.7 to integrate a wide range of diverse data sourcesincluding data from different robots and control modalitieshuman videosand autonomous data. While our prior models also used some of these data sources (e.g.videos)π0.7 unifies these under a single prompting frameworksupporting:
Diverse language that describes the task and individual sub-steps.
Metadata that describes how the task was performedsuch as speed and quality.
Control modality labels that indicate whether to use joint or end-effector control.
Visual subgoal images that show what the end of the current sub-step should look like. These images can be generated at test time by a world model that provides for visual generalization.
With these different annotation sourcesπ0.7 can leverage more types of data. For examplesuboptimal autonomous evaluation datawhich would ordinarily risk teaching the model to perform lower-quality actionscan be incorporated by annotating it with appropriate metadata (e.g.lower quality or lower speed).








































One of the toughest generalization challenges for robotic foundation models is following user prompts to perform a new task. π0.7 shows early signs of compositional task generalization through a combination of diverse language instructionslanguage coachingand visual subgoals. We first observed this emergent ability when we tasked the model to operate a variety of kitchen appliances. We did not collect demonstrations of these specific appliance tasksand instead tried to prompt the model to operate them. For each appliancethe robot received language coaching for using the appliance – step by step language commands similar to those that could guide a person using the appliance for the first time. When we ask the robot to do a new taskusing an air fryer appliance to cook a sweet potatoit makes a reasonable attemptperforming part of the task after a few false startsbut not finishing it fully:
π0.7 attempting to use an air fryer with only a zero-shot prompt: "load a sweet potato into the air fryer".
Howeverif we walk it through the task with step-by-step language coachingit performs the task much more effectively. This is harder than it seems. It requires understanding the fine-grained instructions and grounding them correctly:
π0.7 using the air fryer with step-by-step verbal coaching.
After we’ve provided language coaching to the robot multiple timeswe can use the instructions to fine-tune a high level policy that can then generate the language subgoals fully autonomouslysignificantly improving fully autonomous execution of the task without any additional teleoperation at all. The robot has effectively learned the task from language coaching:
π0.7 performing the air fryer task with a fine-tuned high-level policy generating language subtasks. We also visualize the subgoal images produced by our world model for each language subtask. The language subtask and subgoal images are provided to π0.7 to perform the task fully autonomously.
We wanted to understand where the robot learned what an air fryer even is. The size and diversity of our training set makes it hard to track down the precise episodes that informed this behaviorand the knowledge likely comes from a combination of robot episodes and web-scale vision-language pre-training. After a lot of searchingwe found two episodes we collected in a home where a robot closes an air fryer (labeled “push the frying basket into the airfryer” and "put the basket of the airfryer on the leftmost side of the counter")and data from the open-source DROID dataset on a Franka robot. These episodes look quite different from what the mobile robot actually does in our experimentssuggesting that π0.7 can generalize and compose behaviors to load the sweet potato into the air fryermuch like how an LLM composes different parts of text seen in large-scale datasets from the web:
The closest episodes we found to the air fryer task: two episodes closing air fryersand data from the open-source DROID dataset with a Franka arm.
With the improved generalization and language following capabilities of π0.7we can direct it with language to perform a wide variety of tasksinteractively “teach” new behaviorsand even have a bit of fun with precise and varied language commands!
Interactively directing π0.7 with varied language commands.
π0.7 shows some of the most effective generalization of tasks across embodiments that we’ve seen. One of the most underrepresented embodiments in our training set is the bimanual UR5e systemconsisting of two UR5e industrial arms with Robotiq parallel jaw grippers. This robot is hard to teleoperate: the heavy arms have a lot of inertiaand the grippers are relatively imprecise. We tasked π0.7 to control this robot to fold laundryeven though we did not collect any data of laundry folding with this robotand to our surprise it could do this consistently. Note that the physical motion of the robot when folding t-shirts differs significantly from the (much smaller) robot that we used to collect t-shirt folding data:
We collected data of laundry folding with the static bimanual robot (left)and then evaluated π0.7 on this task with the bimanual UR5e system (right). No training data was collected for this task with the UR5e bimanual system. Because the robots differ significantly in sizepositioningand morphologyπ0.7 has to employ a substantially distinct strategy with the UR5e. Its success rate matches the "zero shot" success rate of expert teleoperators who have performed the task on the source robotbut attempt it on the UR5e for the first time.
The success rate of π0.7 on this task actually matches the "zero shot" success rate of expert human teleoperators who had collected the training data for this task on the original robotwhen the same teleoperators were asked to perform the task with the bimanual UR5e system. These teleoperators had a mean of 375 hours of teleoperation experience.
Besides broad generalizationwe also would like our models to achieve high success rates and to perform tasks quickly. In our recent workwe introduced Recapan algorithm for training policies with RL to optimize for robustness and throughput. While Recap provided an effective way to optimize policieswith π0.7 we were able to have a single general-purpose model that could perform all of the tasks that we optimized with Recap with the same success rate and (sometimes even higher) throughput by distilling experience generated during Recap training into the π0.7 model with strategy metadata. The same π0.7 model can perform the laundry foldingespresso makingand box folding tasks to the same or even higher level of performance as the best models trained with Recap:
A quantitative comparison of the single π0.7 model performing each of the tasks introduced in our Recap blog postin comparison with the RL-trained specialist policy for each task. The y-axis is normalized by the throughput of the specialist. By training on diverse data, including autonomous data from the RL-trained modelsthe single π0.7 model attains similar or even stronger performance across tasks as the task-specific RL-trained specialists.
π0.7 is a general-purpose modelin the sense that it can control a wide variety of different robots to perform a wide range of different tasks. Besides the specific controlled experiments that we discuss above that evaluate particular axes of performance and generalizationwe also tested a wide range of tasks that include peeling vegetablescleaning a glass door with Windexand more. We show a selection of some of the tasks that π0.7 can perform with various robotic platforms below.
π0.7 is a single unified model with emergent compositional generalizationthe ability to follow diverse instructions and visual subgoalsand strong out-of-the-box performance even on tasks that previously required fine-tuned specialist models. Powerful and steerable models like π0.7 might make it possible in the future to solve even more complex unseen tasks by having the model “think through” possible ways to perform themleverage its ability to follow diverse prompts to ground these thoughts in actionsand then reflect on the outcomes to revise the task plan. Effective prompt following and generalization is thus useful not only for allowing people to better specify what they want a robot to dobut also for grounding the semantic reasoning and problem solving capabilities of modern foundation modelsallowing them to translate their semantic generalization capabilities into physical generalization.
If you are excited about working on these problems and would like to join usget in touch! And if you would like to collaborate with us and apply models like π0.7 to real-world robotics applicationsyou can contact us at [email protected].