Anthropic’s latest Claude 3.5 Sonnet AI model has a new feature in public beta that can control a computer by looking at a screenmove a cursor, click buttons and enter text. The new feature, called “Computer Use,” is available today on the API, allowing developers to ask Claude to work on a computer like a human does, as shown on a Mac in the video below.
from Microsoft Co-pilot vision functionality and OpenAI desktop app for ChatGPT have shown what their AI tools can do based on your computer screen view, and Google has similar capabilities in its Gemini application on Android phones. But they haven’t taken the next step of widely releasing tools that are ready to click and do tasks for you like this. Rabbit has promised similar capabilities for its R1, which it has not yet delivered.
Anthropic cautions that computer use is still experimental and may be “cumbersome and error-prone.” The company says: “We publish PC usage early to gather developer feedback and expect capabilities to improve quickly over time. »
There are many actions that people regularly perform with computers (drag, zoom, etc.) that Claude cannot yet attempt. The “flipbook” nature of Claude’s view of the screen (taking screenshots and stitching them together, rather than observing a more granular video feed) means he may miss short-lived actions or notifications .
Additionally, it appears that this version of Claude has been asked to stay away from social media, with “monitoring measures when Claude is asked to engage in election-related activities, as well as systems to encourage Claude to move away from activities such as generating and publishing content on social networks. social media, registering web domains, or interacting with government websites.
Meanwhile, Anthropic says its new Claude 3.5 Sonnet model features improvements in many benchmarks and is offered to customers at the same price and speed as its predecessor:
The update Claudius 3.5 Sonnet shows wide-ranging improvements over industry benchmarks, with particularly large gains in agent coding and tool usage tasks. Concerning coding, this improves performance on Verified SWE Bench from 33.4% to 49.0%, a score higher than all publicly available models, including reasoning models like OpenAI o1-preview and specialized systems designed for agent coding. It also improves performance on TAU bencha task of using agentic tools, from 62.6% to 69.2% in the retail domain, and from 36.0% to 46.0% in the more complex airline domain.