Anthropic has unveiled the upgraded Claude3.5Sonnet and the brand-new model Claude3.5Haiku, both of which have made significant strides in reasoning, coding, and visual processing. Claude3.5Sonnet has undergone comprehensive enhancements, leading the industry in coding capabilities and excelling in multiple industry benchmark tests.
Notably, it achieved a score of 49.0% on the SWE-bench Verified test, surpassing all publicly available models, including inference models like OpenAI o1-preview and systems designed specifically for proxy coding. Additionally, it scored 69.2% in the retail sector of the TAU-bench for proxy tool usage tasks and 46.0% in the more challenging aviation sector.
Most remarkably, Claude3.5Sonnet introduced the "computer usage" feature in its open beta, allowing developers to use computers as humans do. This means Claude can view screens, move cursors, click buttons, and input text, opening new possibilities for automation processes, software building and testing, and open-ended tasks.
Claude3.5Haiku is Anthropic's fastest model, offering performance comparable to Claude3Opus at a lower cost and faster speed. It excels in coding tasks, achieving a score of 40.6% on the SWE-bench Verified test, outperforming many proxies using publicly available state-of-the-art models, including the original Claude3.5Sonnet and GPT-4o.
Claude3.5Haiku is ideal for user-facing products, specialized sub-proxy tasks, and generating personalized experiences from vast datasets such as purchase histories, pricing, or inventory records.
To achieve these general skills, Anthropic has built an API that allows Claude to perceive and interact with computer interfaces. Developers can integrate this API, enabling Claude to translate instructions (e.g., "Fill out this form using my computer and online data") into computer commands (e.g., checking spreadsheets; moving the cursor to open a web browser; navigating to relevant web pages; filling out forms with data from those pages, etc.).
In the OSWorld test evaluating AI models' ability to use computers like humans, Claude3.5Sonnet scored 14.9% in the screenshot-only category, significantly outperforming the second-ranked AI system's 7.8% score. When more steps were required to complete the task, Claude's score reached 22.0%.
Anthropic emphasizes that although this feature is expected to improve rapidly in the coming months, Claude's current ability to use computers is not yet perfect. Some operations that humans can perform easily (such as scrolling, dragging, zooming) remain challenging for Claude, and Anthropic encourages developers to explore with low-risk tasks first.
Given that computer usage may provide new avenues for common threats such as spam, misinformation, or fraud, Anthropic is taking a proactive approach to ensure its safe deployment. They have developed new classifiers to identify when computer usage is occurring and whether harm is being done.
Currently, Claude3.5Sonnet is available to all users. Starting today, developers can build with the "computer usage" beta on Anthropic API, Amazon Bedrock, and Google Cloud's Vertex AI. The new Claude3.5Haiku will be released later this month.