A recent study reveals that even advanced AI language models, such as OpenAI's latest o1-preview, struggle with complex planning tasks.

This research, conducted jointly by scientists from Fudan University, Carnegie Mellon University, ByteDance, and The Ohio State University, evaluated the performance of AI models on two planning benchmarks: BlocksWorld and TravelPlanner.

Artificial Intelligence, AI, Human Brain, Future

In the classic planning task of BlocksWorld, most models had accuracy rates below 50%, with only o1-mini (slightly below 60%) and o1-preview (close to 100%) performing relatively well.

However, when researchers turned their attention to the more complex TravelPlanner, the performance of all models was disappointing. GPT-4o achieved a final success rate of only 7.8%, while o1-preview reached 15.6%. Other models, such as GPT-4o-Mini, Llama3.1, and Qwen2, scored between 0 and 2.2%. Although o1-preview showed improvement over GPT-4o, it still fell far short of human planning capabilities.

The researchers identified two main issues. Firstly, models performed poorly in integrating rules and conditions, often resulting in plans that violated preset guidelines. Secondly, as planning time increased, they gradually lost focus on the original problem. To measure the impact of different input components on the planning process, the research team used a "permutation feature importance" method.

Additionally, the research team tested two common strategies to enhance AI's planning abilities. The first was episodic memory updating, which acquired knowledge from previous planning attempts, improving understanding of constraints but not leading to more detailed consideration of individual rules. The second was parametric memory updating, which enhanced the impact of tasks on planning through fine-tuning, but the core issue—diminishing influence over extended plans—remained. Both methods showed some improvement but failed to fully address the fundamental problems.

It is worth noting that the related code and data will soon be made publicly available on GitHub.

Code repository: https://github.com/hsaest/Agent-Planning-Analysis

Key points:

🌍 The study shows that AI models like OpenAI's o1-preview perform poorly in complex travel planning, with GPT-4o achieving a success rate of only 7.8%.  

📉 Most models perform reasonably well in BlocksWorld but struggle to achieve ideal results in TravelPlanner.  

🧠 The research found that models primarily suffer from inadequate integration of rules and a loss of focus over time.