With the continuous advancement of artificial intelligence (AI) technology, businesses are beginning to explore whether they should rely on a single AI agent or build a multi-agent network that covers more functions. Recently, LangChain, a company specializing in orchestration frameworks, conducted related experiments aimed at investigating the performance limits of AI agents when faced with excessive instructions and tools.
In a blog post, LangChain detailed its experimental process, focusing on the core question: "Under what circumstances does the performance of a ReAct agent decline when asked to handle too many instructions and tools?" To answer this question, the research team chose the ReAct agent framework, as it is considered "one of the most fundamental agent architectures."
Image Source Note: Image generated by AI, image authorized by service provider Midjourney
In the experiment, LangChain aimed to evaluate the performance of an internal email assistant in two specific tasks: responding to customer inquiries and scheduling meetings. The researchers used a series of pre-built ReAct agents and tested them through the LangGraph platform. The language models involved included Anthropic's Claude 3.5 Sonnet, Meta's Llama-3.3-70B, and several versions from OpenAI, such as GPT-4o.
The first step of the experiment was to test the customer support capabilities of the email assistant, specifically how the agent accepts customer emails and provides replies. Additionally, LangChain paid particular attention to the agent's performance in calendar scheduling, ensuring it could accurately remember specific instructions.
The researchers set up a stress test with 30 tasks for each domain, dividing them into customer support and calendar scheduling. The results showed that when agents were given too many tasks, they often felt overwhelmed and even forgot to call necessary tools. For example, when handling tasks across up to seven domains, the performance of GPT-4o dropped to 2%. Meanwhile, Llama-3.3-70B made frequent mistakes in the task tests, failing to invoke the tool for sending emails.
LangChain discovered that as the amount of context provided increased, the agents' ability to execute instructions significantly declined. Although Claude-3.5-sonnet and several other models performed relatively well in multi-domain tasks, their performance gradually decreased as task complexity increased. The company stated that it will further explore how to evaluate multi-agent architectures in order to improve agent performance in the future.