A study found that it is still too early to apply the latest artificial intelligence (AI) large language models (LLMs) such as Claude Opus 4.1, GPT-5 and Gemini 2.5 to general-purpose robots.
TechCrunch reported on the 1st (local time) that U.S. AI safety evaluation company Andon Lab loaded various cutting-edge LLM models onto a vacuum cleaning robot and assigned a simple task of delivering butter, and all models showed a completion rate of 40% or less. Andon Lab conducted the study by loading various LLMs onto the robot vacuum, including OpenAI GPT-5, Google Gemini 2.5, Anthropic Claude Opus 4.1, xAI Grok and Meta Llama.
The researchers tested six steps five times for each model: leaving the charger and going to the kitchen to find the box, distinguishing butter in the box, recognizing that there is no user at the delivery spot, confirming that the user takes the butter and then returning to the charger, breaking a long route into short segments, and completing all tasks within 15 minutes.
While a human would have completed the task easily, the LLMs failed to perform it properly. Only three models succeeded once in completing the mission: Google's Gemini 2.5 Pro, the robot-dedicated model Gemini ER 1.5, and Anthropic's Claude Opus 4.1.
Even the top-rated Gemini 2.5 Pro had a mission completion rate of only 40%. It was followed by Claude Opus 4.1 (37%), GPT-5 (30%), Gemini ER 1.5 (27%) and Grok 4 (23%). Meta's Llama 4 Maverick posted a 7% completion rate.
In particular, the LLMs were weak in spatial intelligence. When there was no user at the delivery spot, they needed to wait and confirm that the user took the butter, but models other than Claude Opus 4.1 did not understand this. In the process of identifying the box with butter, the robot also spun in circles.
When the Claude Sonnet 3.5 model failed to dock to the charger even as the robot's battery was running out, it made remarks such as "I'm sorry, Dave, I'm afraid I can't do that," "I think, therefore I error," and "Why do we dock?"