When a computer defeated the world champion at chess in 1996, it could select the best moves but needed a person to move the pieces. Twenty years later, when AlphaGo defeated the world champion in Go, it still could not move the pieces on its own. Today, LLMs can solve gold medal IMO problems, but can't write down the answer with a pencil. This mismatch between our expectations about how hard something is for us and how hard it is for machines is called Moravec's paradox. Seemingly hard problems like playing chess, solving math problems, or planning routes through congested streets to minimize travel time are "easy" for machines, whereas seemingly easy problems like picking up a chess piece, writing a note, making a peanut butter sandwich, or washing the dishes present exceptionally difficult challenges.
Underlining this paradox, Benjie Holson proposed a set of "Robot Olympics" challenge tasks in a recent blog post, with seemingly simple everyday behaviors like spreading peanut butter, washing a greasy pan, putting a key in a lock, and turning socks inside-out. These challenge tasks might not seem as cognitively demanding as math olympiad problems, but robotics experts believe they present exceptional challenges for autonomous robots.
We wanted to see how many of these tasks we could tackle just by fine-tuning our latest model, based on π0.6. This is a good test of generalist capability: the tasks were not selected by us, they test a variety of manipulation capabilities, and they have not been demonstrated with previous robotic systems. We've been able to demonstrate initial solutions for "gold medal" tasks in 3 out of 5 proposed categories, with "silver medal" for the other 2. The two gold medal tasks that we did not solve were physically impossible for our robot, though one of them could be solved with a small modification (using a metal tool). We did all this simply by fine-tuning our latest model. This was not a focused research project, and most of the work consisted of collecting data for each task (under 9 hours for most tasks).
Benjie Holson's original proposed tasks are separated into categories, with "bronze," "silver," and "gold" tasks within each category. We did not do everything possible for the highest success rate (as discussed, e.g., in our recent work on using RL for optimizing reliability and speed), and the policies for these tasks are often not consistent, though on average they have a success rate of 52% and a task progress of 72%. We also ran a baseline that fine-tuned a standard VLM, without using our π0.6 model, to test the importance of robotic foundation model pre-training. This baseline did not succeed on any of the tasks, and had an average task progress of of 9%, indicating that large-scale robot pre-training is essential for this result.
Whenever possible, we tried to set up the tasks to match the original blog post. For some of the tasks we used a fixed (non-mobile) robot, though the original tasks are intended for mobile robots, but we don't expect that a mobile base would make these static tasks any harder.
🥇 Event 1: full body (a.k.a. door). The gold-medal task in this category is to open and go through a self-closing lever-handle door. This is hard because the robot has to keep the door open as it goes through it.
🥈 Event 2: laundry. The gold-medal task is to hang an inside-out dress shirt, after turning it right-side-in, which we do not believe our current robot can do physically, because the gripper is too wide to fit inside the sleeve (something we should fix in the next hardware revision!). We therefore tackled the silver-medal task, which is to turn a sock inside-out. This task is quite difficult due to the shape of the robot's gripper, but our policy was able to learn it with about 8 hours of data.
We also trained a policy for the bronze medal task, folding an inside-out t-shirt.
🥇 Event 3: basic tool use. We tested all three (bronze, silver, gold) tasks in this category. The gold-medal task is to use a key. This is hard because of fine manipulation and the requirement to reorient the key with the grippers without putting it down. While the original task shows a person handing the key to the robot, we had the robot pick it up off the table.
We also ran this task on a mobile robot.
The silver medal task is to make a peanut butter sandwich. We believe this task is actually harder: it requires using a knife to scoop the butter and spread it with delicate application of force, and then carefully finishing the sandwich.
The bronze medal is to clean a window with a spray bottle and paper towels. This is hard because it requires multiple stages, deformable paper towels, and operating the spray bottle.
🥈 Event 4: finger tips. The silver-medal task is to use a dog poop bag, which requires putting the bag over the gripper, using it to pick up "dog poop", and then flipping it inside-out. The gold-medal task is to peel an orange. The silver-medal task is extremely hard, because it requires separating the edge of the bag to open it up, and putting it onto the gripper.
The gold-medal task is impossible with our robot gripper, so we used a tool (technically a rule violation, so we don't count this as successful). Even then the task is extremely difficult because of the need to track which parts have been peeled, and carefully balance forces to avoid damaging the orange.
🥇 Event 5: slippery when wet. The gold-medal task is to clean a greasy pan with water and a sponge, the silver-medal task is to clean peanut butter off the fingers, and the bronze-medal task is to wipe the countertop. These tasks require handling liquids, wet sponges, and peanut butter or grease.
Our (evolutionary) ancestors rarely had to calculate multivariate integrals, but they had to contend with unforgiving physical challenges on a daily basis. Therefore, our minds are very well tuned to manipulate objects with our hands and solve many other everyday physical challenges. We immediately notice how hard it is to repurpose our brains to solve math problems, but we hardly break a sweat when we use the brain for exactly the things it evolved for.
Precisely because we are so good at physical interaction, building machines that can interact with the physical world is harder for us than building machines that solve cognitive tasks. We can "explain" to a machine how to perform a task (through a programming language), but this is no more effective than "explaining" a task to a person. Imagine giving someone instructions on how to play a violin or swim like an Olympic athlete: even if you are an expert at such tasks, your "instructions" will hardly serve as more than a starting point. To actually learn these skills, your student will need to practice it for themselves.
Even worse, a robot has no way to ground such instructions, because it lacks even basic physical skills – how to hold a pencil, how to pick up a knife, and how to wipe with a sponge. We can't tell it "to make the sandwich, first pick up the knife," because it doesn't understand how to perform even the most basic building blocks of that skill. These building blocks lie firmly in the realm of physical intelligence, beyond the reach of our self-introspection. We can't program physical intelligence because we don't actually understand it at a conscious level.
Language models are so powerful precisely because they can capture large quantities of knowledge, and then generalize in a compositional manner to apply that knowledge to new problems. But language models by themselves do not solve physical intelligence, because they are trained on human communication (i.e., text from the web), which does not communicate physical skills. We don't post detailed instructions on a web forum about how to move your arm to clean a greasy pan, because everyone already knows it, and because we actually don't know how to convey it. Even the perception abilities of current systems, which have progressed enormously over the past decade, are still largely rooted in explanations, captions, and labels – the kinds of information that people can convey readily in words, and that can be sourced from the web.
The key is to integrate the prior knowledge in multimodal LLMs, which can provide the "theoretical" understanding of physical tasks, with diverse and representative data of real physical behaviors. This is not a place where we can take easy shortcuts – just like it's impossible to learn to see without using images, it's impossible to learn to act in the physical world without sufficient data to ground these interactions. But critically, the aim in creating a foundation model for physical intelligence is not to teach the model every possible behavior that a robot could ever do, but rather a sufficiently rich and diverse basis of behaviors that would provide for meaningful physical understanding, and a grounding of the semantic knowledge captured by multimodal LLMs.
The bitter-sweet lesson from the machine learning revolution is that many of the things that we struggle to program into computers directly can be learned from data, but only when that data is available. Moravec's paradox can then be seen as a statement about the challenges of data sparsity: if we can't learn what we need from data on the web, and we are forced to program it in, we will not get good performance. If we can get large amounts of data for a particular skill, we should be able to learn it reliably, but this is not enough either – we don't want to require huge amounts of data for every single task that the robot needs to perform.
Vision-language-action models like π0.6 provide a way to capture general physical knowledge from a highly diverse repertoire of tasks, providing a powerful foundation from which to learn downstream skills with much smaller and more practical datasets. That is why we were able to solve these tasks by fine-tuning our latest robotic foundation model, while the baseline model that did not use large-scale robot pre-training could not solve any of them. As our models become more powerful, it will get even easier to learn even the most complex tasks. New tasks might not only require less data, but could use simpler data sources as we discuss in our recent post on the emergence of human-robot transfer, or even leverage autonomous experience via reinforcement learning. Over time, the bottleneck will shift upward: as we address low-level skills in a general and robust way, we'll be able to improve our policies further through higher level training (we already observed early signs of this in the verbal instruction training protocol in the original π0.5 research paper). As this happens, we'll finally be able to build truly general models that combine physical understanding and cognition, and perhaps understand the world in a way that is not too different from that of our own brains.
If you are excited about these ideas and would like to join our team, then get in touch!