BaiChuan Intelligence has partnered with Tianjin University to launch the "Sibyl System" agent framework, achieving first place on the GAIA Leader Board. GAIA, proposed by Meta, Huggingface, and AutoGPT in November 2023, is a novel evaluation scheme primarily assessing the capabilities and solutions of agents in executing complex tasks. This evaluation scheme reveals the shortcomings of existing models and provides directions for improvement in model and agent development.
The test questions of GAIA are closer to the real world, requiring AI to possess abilities such as reasoning, multi-modal understanding (text, images, audio/video), web browsing, and tool usage. These questions are easy for humans to understand but pose significant challenges for models. For instance, GPT-4 has a success rate of only 15% in the tests, while human experimenters can achieve 92%. Completing these questions usually requires a long logical chain and time, involving multiple steps and tools.
"Sibyl System" framework features include:
Human-like browser interface as an alternative to retrieval-augmented generation.
Question-answering instead of dialogue, using stateless question-answering functions to simplify the system architecture.
Using only two general tools, the web browser and Python environment, reducing reliance on specialized tools.
From System1 to System2, introducing a "jury" mechanism, conducting self-criticism and correction through multi-agent debates, and improving response accuracy by utilizing information in the global workspace.
Sibyl System is a structurally simple yet powerful agent framework based on large language models, capable of solving complex reasoning problems using a few tools. By introducing the Global Workspace and Multi-Agent mechanisms, as well as a browser-based general information acquisition channel, it reduces system complexity while expanding the complexity of problem-solving, achieving a shift in models from "fast thinking" to "slow thinking." Sibyl System also has excellent scalability and ease of debugging, allowing for easy replacement of other model's agent modules to enhance model capabilities.
Technical Report: https://arxiv.org/pdf/2407.10718