Since Anthropic launched Claude's "computer usage" feature in October, the capabilities of AI agents have garnered widespread attention. This feature makes Claude the first cutting-edge model capable of interacting through the same graphical user interface (GUI) as humans.

Claude completes tasks by accessing desktop screenshots and operating via keyboard and mouse, providing users with a convenient way to automate operations without the need for an API interface.

image.png

In a study conducted by the Show Lab at the National University of Singapore, researchers tested Claude on various tasks, including web searches, workflow completion, office productivity, and video games. These tasks examined Claude's abilities in different scenarios, such as searching for and purchasing products on a webpage, or extracting information from a website and inserting it into a spreadsheet. Through these tests, researchers evaluated Claude's performance from three dimensions: planning, action, and assessment.

Claude's performance in executing complex tasks is impressive. It can devise clear plans, execute them step by step, and assess its progress at each stage. Additionally, it can coordinate across multiple applications, such as copying information from a webpage to a spreadsheet. In some cases, Claude can even review the results at the end of a task to ensure everything aligns with the objectives.

However, Claude also makes some simple mistakes that a typical user would easily avoid. For instance, in one task, it failed to complete a subscription because it did not scroll down the webpage to find the corresponding button.

There are also instances where it appears clumsy when performing obvious tasks, such as selecting and replacing text or changing bullet points to numbers. Furthermore, Claude sometimes fails to recognize its own mistakes or makes incorrect assumptions about the reasons for not achieving its goals.

Researchers noted that Claude's shortcomings in self-assessment mechanisms may be a reason for these errors, indicating a need for future improvements in the GUI agent framework to incorporate stricter self-assessment modules. The findings also show that existing GUI agents do not fully replicate the fundamental nuances of how humans use computers.

For businesses, the potential of automating tasks using simple text descriptions is highly appealing, but the technology has not yet reached a level of maturity suitable for widespread application. The model's behavior is unstable, which could lead to unpredictable consequences in sensitive applications. Moreover, executing operations through human-designed interfaces is not necessarily the fastest way to complete tasks.

Before widespread deployment, companies must also consider the security risks associated with granting large language models (LLMs) control over the mouse and keyboard. For example, studies have shown that web proxies are susceptible to adversarial attacks that humans can easily overlook. Nevertheless, tools like Claude can still help product teams explore ideas and iterate on solutions, saving time and costs before developing new features or services.

Key Points:

1. 🤖 Claude has the ability to automate complex tasks through a graphical user interface, performing exceptionally well.

2. ⚠️ Claude makes mistakes when executing simple tasks, reflecting its shortcomings in self-assessment mechanisms.

3. 💼 Currently, this technology is not suitable for large-scale application, and businesses should approach potential security risks with caution.