Google's PaLM-E is a generalist robot brain that takes commands

/var/www/vhosts/lawyersinamerica.com/httpdocs/app/views/singleBlog/singleBlogView.php on line 59
">

Benj Edwards

Blogger

Sponsor

Stay Informed

Subscribe to our newsletter and be the first to see the latest news and advice.

* By signing up you agree to our Privacy Policy

Biz & IT

Mar 2023

On Monday, a group of AI researchers from Google and the Technical University of Berlin unveiled PaLM-E, a multimodal embodied visual-language model (VLM) with 562 billion parameters that integrates vision and language for robotic control. They claim it is the largest VLM ever developed and that it can perform a variety of tasks without the need for retraining.

It's also resilient and can react to its environment. For example, the PaLM-E model can guide a robot to get a chip bag from a kitchen--and with PaLM-E integrated into the control loop, it becomes resistant to interruptions that might occur during the task. In a video example, a researcher grabs the chips from the robot and moves them, but the robot locates the chips and grabs them again.

In another example, the same PaLM-E model autonomously controls a robot through tasks with complex sequences that previously required human guidance. Google's research paper explains how PaLM-E turns instructions into actions:

We demonstrate the performance of PaLM-E on challenging and diverse mobile manipulation tasks. We largely follow the setup in Ahn et al. (2022), where the robot needs to plan a sequence of navigation and manipulation actions based on an instruction by a human. For example, given the instruction "I spilled my drink, can you bring me something to clean it up?", the robot needs to plan a sequence containing "1. Find a sponge, 2. Pick up the sponge, 3. Bring it to the user, 4. Put down the sponge." Inspired by these tasks, we develop 3 use cases to test the embodied reasoning abilities of PaLM-E: affordance prediction, failure detection, and long-horizon planning. The low-level policies are from RT-1 (Brohan et al., 2022), a transformer model that takes RGB image and natural language instruction, and outputs end-effector control commands.

PaLM-E is a next-token predictor, and it's called "PaLM-E" because it's based on Google's existing large language model (LLM) called "PaLM" (which is similar to the technology behind ChatGPT). Google has made PaLM "embodied" by adding sensory information and robotic control.

Since it's based on a language model, PaLM-E takes continuous observations, like images or sensor data, and encodes them into a sequence of vectors that are the same size as language tokens. This allows the model to "understand" the sensory information in the same way it processes language.

In addition to the RT-1 robotics transformer, PaLM-E draws from Google's previous work on ViT-22B, a vision transformer model revealed in February. ViT-22B has been trained on various visual tasks, such as image classification, object detection, semantic segmentation, and image captioning.

Google Robotics isn't the only research group working on robotic control with neural networks. This particular work resembles Microsoft's recent "ChatGPT for Robotics" paper, which experimented with combining visual data and large language models for robotic control in a similar way.

You may be also interested in

Go to blog

Vulnerability with 9.8 severity in Control Web Panel is under active exploit

Biz & IT

Google's PaLM-E is a generalist robot brain that takes commands

Further Reading

Further Reading

You may be also interested in

Vulnerability with 9.8 severity in Control Web Panel is under active exploit

These Movie Theaters Have the Best Summer Ticket Deals

Firewalla launches a web-based security portal to help its users manage multiple firewalls