1/7
What if we could show a robot how to do a task?
We present Vid2Robot, which is a robot policy trained to decode human intent from visual cues and translate it into actions in its environment.
Website: Vid2Robot: End-to-end Video-conditioned Policy Learning with Cross-Attention Transformers
Arxiv: [2403.12943] Vid2Robot: End-to-end Video-conditioned Policy Learning with Cross-Attention Transformers
(1/n)
2/7
The prompt video shows a human moving a green jalapeno chip bag near a Coke can. Note that the tables and camera viewpoints are different from the current robot. The policy needs to identify the task and also figure out how to act in its current environment.
2/
3/7
When using human videos as prompts, we find that Vid2Robot is better than BC-Z (bc-z) even when both are trained with the same data.
3/
4/7
We investigate two emergent capabilities of video-conditioned policies:
(i) How effectively can the motion shown in a prompt video be performed on another object in the robot’s view?
(ii) How can we chain the prompt videos for longer horizon tasks?
4/
5/7
For Cross Object Motion Transfer, Vid2Robot can perform the motion shown with one object on other objects, and in many cases, it is much better than BC-Z.
5/
6/7
Vid2Robot can also solve some of the long-horizon tasks, by combining prompt videos for each step.
6/
7/7
How do we train this model?
Vid2Robot leverages cross-attention mechanisms to fuse human video features with the robot's current state, generating actions that are relevant to perform the observed tasks.
7/
What if we could show a robot how to do a task?
We present Vid2Robot, which is a robot policy trained to decode human intent from visual cues and translate it into actions in its environment.
Website: Vid2Robot: End-to-end Video-conditioned Policy Learning with Cross-Attention Transformers
Arxiv: [2403.12943] Vid2Robot: End-to-end Video-conditioned Policy Learning with Cross-Attention Transformers
(1/n)
2/7
The prompt video shows a human moving a green jalapeno chip bag near a Coke can. Note that the tables and camera viewpoints are different from the current robot. The policy needs to identify the task and also figure out how to act in its current environment.
2/
3/7
When using human videos as prompts, we find that Vid2Robot is better than BC-Z (bc-z) even when both are trained with the same data.
3/
4/7
We investigate two emergent capabilities of video-conditioned policies:
(i) How effectively can the motion shown in a prompt video be performed on another object in the robot’s view?
(ii) How can we chain the prompt videos for longer horizon tasks?
4/
5/7
For Cross Object Motion Transfer, Vid2Robot can perform the motion shown with one object on other objects, and in many cases, it is much better than BC-Z.
5/
6/7
Vid2Robot can also solve some of the long-horizon tasks, by combining prompt videos for each step.
6/
7/7
How do we train this model?
Vid2Robot leverages cross-attention mechanisms to fuse human video features with the robot's current state, generating actions that are relevant to perform the observed tasks.
7/