In a not to investors last spring, Anthropic said it intends to develop AI to power virtual assistants that can perform searches, respond to emails and handle other tasks on their own. back office. The company called this a “next generation algorithm for AI self-learning” – one that it says could, if all goes according to plan, one day automate much of the economy .
It’s taken a while, but this AI is starting to arrive.
Anthropic Tuesday released an improved version of its Claudius 3.5 Sonnet model capable of understanding and interacting with any desktop application. Via a new “Computer Use” API, now in open beta, the model can imitate keystrokes, button clicks and mouse gestures, essentially emulating a person sitting at a PC.
“We trained Claude to see what’s happening on a screen and then use available software tools to complete tasks,” Anthropic wrote in a blog post shared with TechCrunch. “When a developer instructs Claude to use computer software and gives him the necessary access, Claude looks at screenshots of what is visible to the user, then counts the number of pixels vertically or horizontally he has need to move a cursor in order to click. the right place.
Developers can test Computer Use via Anthropic’s API, Amazonian baseand that of Google Cloud AI Summit platform. The new Sonnet 3.5 without Computer use extends to Applications Claudeand brings various performance improvements over the outgoing Sonnet 3.5 model.
Application automation
A tool that can automate tasks on a PC is not a new idea. Countless companies offer such tools, from Decades-old RPA providers to newcomers like Relay, Induced AIAnd Automaton.
In the race to develop so-called “AI agents,” the field has only become more crowded. AI agents remain an ill-defined term, but it generally refers to AI capable of automating software.
A few analysts claim that AI agents could provide businesses with an easier way to monetize billions of dollars that they invest in AI. Companies seem to agree: according to a recent study by Capgemini investigation10% of organizations already use AI agents and 82% will integrate them within three years.
Salesforce created vivid announcements on its AI agent technology this summer, while Microsoft praised new tools to create AI agents yesterday. OpenAI, which is carve out your own brand of AI agentsconsiders this technology a step towards super intelligent AI.
Anthropic calls its AI agent concept an “action execution layer” that allows the new Sonnet 3.5 to execute commands at the desktop level. With its ability to browse the web (not a first for AI models, but a first for Anthropic), 3.5 Sonnet can use any website and any application.
“Humans maintain control by providing specific prompts that direct Claude’s actions, such as ‘use my computer and online data to fill out this form,'” an Anthropic spokesperson told TechCrunch. “People allow access and limit access if necessary. Claude breaks down user prompts into computer commands (e.g., move cursor, click, tap) to accomplish that specific task.
Software development platform Replit used an early version of the new Sonnet 3.5 model to create an “autonomous checker” that can evaluate applications as they are built. Canva, meanwhile, says it’s exploring ways the new model could support the design and editing process.
But how is it different from other AI agents? It’s a reasonable question. Consumer gadget startup Rabbit built a web agent that could do things like buy movie tickets online; Expertwhat was recently acquired by Amazon, trains models to browse websites and navigate software; And Twin Laboratories uses commercially available models, including those from OpenAI GPT-4oto automate office processes.
Anthropic claims that the new Sonnet 3.5 is simply a stronger, more robust model that can do better at coding tasks than even OpenAI’s flagship product. o1according to the SWE-bench Verified benchmark. Although it has not been explicitly trained to do so, the enhanced Sonnet 3.5 automatically corrects itself and retries tasks when it encounters obstacles, and can achieve goals that require dozens or hundreds of steps.
But don’t fire your secretary just yet.
In an evaluation designed to test an AI agent’s ability to assist with airline reservation tasks, such as changing a flight reservation, the new Sonnet 3.5 was able to successfully complete less than half the tasks. In a separate test involving tasks such as initiating a return, 3.5 Sonnet failed about a third of the time.
Anthropic admits that the upgraded Sonnet 3.5 struggles with basic actions like scrolling and zooming, and that it may miss “short-lived” actions and notifications due to the way it takes screenshots and puts them together.
“Claude’s computer use remains slow and often error-prone,” Anthropic wrote in its article. “We encourage developers to start exploring with low-risk tasks.”
Risky business
But is the new Sonnet 3.5 good enough to be dangerous? Maybe.
A recent study found that the models without the ability to use desktop applications, like OpenAI’s GPT-4o, were willing to engage in harmful “multi-step agent behavior”, like ordering a fake passport from someone on the dark web, when they were “attacked” using jailbreak techniques. Jailbreaks led to high success rates in performing harmful tasks, even for models protected by filters and backups, according to the researchers.
We can imagine how a model with access to the desktop could be interrupted more havoc – let’s say, by operator application vulnerabilities to compromise personal information (or store chats in plain text). Beyond the software levers at its disposal, the model’s online and application connections could open up perspectives for malicious jailbreakers.
Anthropic does not deny that there is a risk in releasing the new Sonnet 3.5. But the company says the benefits of observing how the model is used in nature ultimately outweigh this risk.
“We believe it is far better to provide access to computers in current, more limited and relatively safer models,” the company wrote. “This means we can begin to observe and learn from any potential problems that arise at this lower level, gradually and simultaneously expanding computer use and risk mitigation measures.”
Anthropic also says it has taken steps to deter misuse, such as not training the new Sonnet 3.5 on screenshots and user prompts, and preventing the model from accessing the web during training. The company says it has developed classifiers to “keep” 3.5 Sonnet away from actions perceived as high-risk, such as posting on social media, creating accounts, and interacting with government websites.
As the US general election approaches, Anthropic says it is focused on mitigating abuse of its election-related models. THE American AI Security Institute And British Safety Institutetwo separate but allied government agencies dedicated to assessing the risks of AI models, tested the new Sonnet 3.5 before its deployment.
Anthropic told TechCrunch that it has the ability to restrict access to additional websites and features “if necessary,” to protect against spam, fraud and misinformation, for example. As a security measure, the company keeps all screenshots captured by Computer Use for at least 30 days – a retention period that might alarm some developers.
We asked Anthropic under what circumstances, if any, it would release screenshots to a third party (e.g. law enforcement) if requested to do so. A spokesperson said the company would “comply with data requests in response to valid legal process.”
“There is no foolproof method, and we will continually evaluate and iterate our security measures to balance Claude’s capabilities with responsible use,” Anthropic said. “Those who use the computer version of Claude should take necessary precautions to minimize these types of risks, including isolating Claude from particularly sensitive data on their computer.”
Hopefully this will be enough to avoid the worst.
A cheaper model
Today’s headliner might have been the upgraded 3.5 Sonnet model, but Anthropic also said an updated version of Haiku, the cheapest and most efficient model in its Claude series, was in the works. road.
Claude 3.5 Haiku, expected in the coming weeks, will match the performance of Claude 3 Opus, once Anthropic’s most advanced model, on certain benchmarks, at the same cost and at the same “approximate speed” of Claude 3 Haiku.
“With low latency, improved instruction tracking, and more precise tool usage, Claude 3.5 Haiku is well-suited for user-facing products, specialized sub-agent tasks, and generating personalized experiences from huge volumes of data, such as purchase history, pricing or inventory data,” Anthropic wrote in a blog post.
3.5 Haiku will initially be available as a text-only model, then as part of a multimodal package capable of analyzing both text and images.
So once version 3.5 Haiku is available, will there be many reasons to use version 3 Opus? What about 3.5 Opus, the successor to 3 Opus, which Anthropic teased last June?
“All models in the Claude 3 model family have their individual uses for customers,” the Anthropic spokesperson said. “Claude 3.5 Opus is on our roadmap and we will be sure to share more as soon as possible. »
TechCrunch offers a newsletter focused on AI! Register here to receive it in your inbox every Wednesday.