bnew

Veteran
Joined
Nov 1, 2015
Messages
55,625
Reputation
8,224
Daps
157,094

Google DeepMind Announces LLM-Based Robot Controller RT-2​

OCT 17, 2023 3 MIN READ

by
  • Anthony Alford
    Senior Director, Development at Genesys Cloud ServicesFOLLOW


Google DeepMind recently announced Robotics Transformer 2 (RT-2), a vision-language-action (VLA) AI model for controlling robots. RT-2 uses a fine-tuned LLM to output motion control commands. It can perform tasks not explicitly included in its training data and improves on baseline models by up to 3x on emergent skill evaluations.

DeepMind trained two variants of RT-2, using two different underlying visual-LLM foundation models: a 12B parameter version based on PaLM-E and a 55B parameter one based on PaLI-X. The LLM is co-fine-tuned on a mix of general vision-language datasets and robot-specific data. The model learns to output a vector of robot motion commands, which is treated as simply a string of integers: in effect, it is a new language that the model learns. The final model is able to accept an image of the robot's workspace and a user command such as "pick up the bag about to fall off the table," and from that generate motion commands to perform the task. According to DeepMind,

Not only does RT-2 show how advances in AI are cascading rapidly into robotics, it shows enormous promise for more general-purpose robots. While there is still a tremendous amount of work to be done to enable helpful robots in human-centered environments, RT-2 shows us an exciting future for robotics just within grasp.

Google Robotics and DeepMind have published several systems that use LLMs for robot control. In 2022, InfoQ covered Google's SayCan, which uses an LLM to generate a high-level action plan for a robot, and Code-as-Policies, which uses an LLM to generate Python code for executing robot control. Both of these use a text-only LLM to process user input, with the vision component handled by separate robot modules. Earlier this year, InfoQ covered Google's PaLM-E which handles multimodal input data from robotic sensors and outputs a series of high-level action steps.

RT-2 builds on a previous implementation, RT-1. The key idea of the RT series is to train a model to directly output robot commands, in contrast to previous efforts which output higher-level abstractions of motion. Both RT-2 and RT-1 accept as input an image and a text description of a task. However, while RT-1 used a pipeline of distinct vision modules to generate visual tokens to input to an LLM, RT-2 uses a single vision-language model such as PaLM-E.

DeepMind evaluated RT-2 on over 6,000 trials. In particular, the researchers were interested in its emergent capabilities: that is, to perform tasks not present in the robot-specific training data, but that emerge from its vision-language pre-training. The team tested RT-2 on three task categories: symbol understanding, reasoning, and human recognition. When compared to baselines, RT-2 achieved "more than 3x average success rate" of the best baseline. However, the model did not acquire any physical skills that were not included in the robot training data.

In a Hacker News discussion about the work, one user commented:

It does seem like this work (and a lot of robot learning works) are still stuck on position/velocity control and not impedance control. Which is essentially output where to go, either closed-loop with a controller or open-loop with a motion planner. This seems to dramatically lower the data requirement but it feels like a fundamental limit to what task we can accomplish. The reason robot manipulation is hard is because we need to take into account not just what's happening in the world but also how our interaction alters it and how we need to react to that.

Although RT-2 has not been open sourced, the code and data for the RT-1 have been.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,625
Reputation
8,224
Daps
157,094

[Submitted on 9 Jan 2024]

Large Language Models for Robotics: Opportunities, Challenges, and Perspectives​

Jiaqi Wang, Zihao Wu, Yiwei Li, Hanqi Jiang, Peng Shu, Enze Shi, Huawen Hu, Chong Ma, Yiheng Liu, Xuhui Wang, Yincheng Yao, Xuan Liu, Huaqin Zhao, Zhengliang Liu, Haixing Dai, Lin Zhao, Bao Ge, Xiang Li, Tianming Liu, Shu Zhang
Large language models (LLMs) have undergone significant expansion and have been increasingly integrated across various domains. Notably, in the realm of robot task planning, LLMs harness their advanced reasoning and language comprehension capabilities to formulate precise and efficient action plans based on natural language instructions. However, for embodied tasks, where robots interact with complex environments, text-only LLMs often face challenges due to a lack of compatibility with robotic visual perception. This study provides a comprehensive overview of the emerging integration of LLMs and multimodal LLMs into various robotic tasks. Additionally, we propose a framework that utilizes multimodal GPT-4V to enhance embodied task planning through the combination of natural language instructions and robot visual perceptions. Our results, based on diverse datasets, indicate that GPT-4V effectively enhances robot performance in embodied tasks. This extensive survey and evaluation of LLMs and multimodal LLMs across a variety of robotic tasks enriches the understanding of LLM-centric embodied intelligence and provides forward-looking insights toward bridging the gap in Human-Robot-Environment interaction.
Subjects:Robotics (cs.RO); Artificial Intelligence (cs.AI)
Cite as:arXiv:2401.04334 [cs.RO]
(or arXiv:2401.04334v1 [cs.RO] for this version)
[2401.04334] Large Language Models for Robotics: Opportunities, Challenges, and Perspectives

Focus to learn more

Submission history​


From: Yiwei Li [view email]

[v1] Tue, 9 Jan 2024 03:22:16 UTC (5,034 KB)

 
Top