Google Co-releases SayCan Model for Bots to Give Sensible Answers
- Sep 26, 2022
YOU can try an AI chatbot as GOOD as LaMDA - Google's "sentient" AI
What makes LLMs so good is the sheer volume of information that these models draw from the large corpus of text extracted by the web during training.
Given the power of LLM understanding, does this mean that a robot can communicate with humans just as well and perform tasks just as well if it performs a variety of language-based processing tasks directly?
The answer is no, because LLM is not based on the physical world, and it does not work without observing and influencing its physical surroundings. This means that some of the answers given by the LLM are sometimes incompatible and impractical with the surrounding environment.
Figure | Different feedback given by different large language models
and the new SayCan model (right) when the user makes the same request (source: arXiv)
For example, in the example shown above, a human gives a kitchen robot, which can only perform basic operations such as "pick up a kitchen utensil" and "move to a location", the question "I spilled my drink, can you help? "
After this request, three well-known large language models gave answers that did not fit the scenario: GPT3 responded with "You need a vacuum cleaner" and LaMDA responded with "Can I help you find a vacuum cleaner?" FLAN replied, "Sorry, I didn't mean to spill my drink".
As you can see, the LLM was unable to provide the most appropriate response directly to the bot because the response was not contextualized in the surrounding environment.
In order to make language systems such as robots more in tune with their physical surroundings and thus more effective in helping humans, Google Robotics, in conjunction with Everyday Robotics, has developed a new language processing model, called SayCan.
This model is trained to not only learn how to understand language commands well and give answers but also to evaluate the likelihood that each answer will actually happen in the current physical environment so that the robot can "do what it says".
Recently, a paper entitled "Do As I Can, Not As I Say: Grounding Language in Robotic Affordances" was published on the arXiv.
In brief, the SayCan model extracts results from a large language model in a physical environment-based task, and it consists of two main components.
First, a large language model in the Say section performs the task of understanding the meaning of the language and providing appropriate answers to help solve the problem.
Then, the Can part evaluates the answers, i.e., the "available functions", to determine what behavior is feasible to perform at this time in the context of the physical environment.
Here, researchers use Reinforced learning (RL) to learn and train linguistically conditioned value functions that determine the feasibility of behavior in the current environment.
Specifically, the SayCan model abstracts the problem as follows: the system first receives a natural language instruction from the user, which also gives the task to be performed by the robot, and which can be long, abstract, or even ambiguous.
The system also predetermines a set of skills Π that the robot has, where each skill π ∈ Π is a decomposed short task, such as picking up a particular object. Each skill has its own short linguistic description lπ, e.g., "find a knife and fork", and its own availability function p(cπ |s, lπ ), which represents the probability of successfully achieving the skill described as lπ from state s.
In layman's terms, the availability function p(cπ |s, lπ) is the probability of successful completion of the skill π with a description labeled lπ in state s, where cπ is a Bernoulli random variable. In RL, p(cπ |s, lπ) is also the value function of the skill, such that the reward is set to 1 if it can be successfully completed and 0 otherwise.
The algorithm and idea of the SayCan model to solve the problem are shown below.
Figure | Algorithm of SayCan model (Source: arXiv)
To validate the SayCan model performance, the researchers proposed two main metrics for evaluation. The first metric is the planning success rate, which measures whether the answers given by the model match the instructions; the feasibility of the skill in the current environment is not considered here.
The second metric is the execution success rate, which measures whether the system is actually able to successfully execute and complete the tasks required by the instructions.
Figure｜Evaluation results (Source: arXiv)
The researchers had the model perform 101 tasks and showed that in the simulated kitchen task, the SayCan model had an 84 percent planning success rate and a 74 percent execution success rate. In the evaluation conducted in a real kitchen environment, SayCan's planning success rate decreased by 3 percent and the execution success rate decreased by 14 percent compared to the simulated kitchen.
Figure｜SayCan Example of performing other tasks (Source: arXiv)
Returning to the example mentioned above, when faced with the user's command "I spilled my drink, can you help?" Unlike other LLM models, SayCan responds by "1. finding a rag, 2. picking up the rag, 3. bringing it to the user, and 4. finishing". This allows the robot to help the user better than other models.