Recently, a unique AI capability assessment took place on the "Minecraft" platform, attracting significant attention. The new and old versions of Claude 3.5 Sonnet competed in building challenges, showcasing clear differences in abilities, with the new version (tentatively named "Sonnet 3.6") performing particularly well.

This test, initiated by developer adi, has been humorously dubbed the "only reliable benchmark." Benchmark researcher Aidan McLau believes this method perfectly meets the current needs for AI assessment and points out that aesthetic ability is closely related to intellectual level. The project quickly gained support from the open-source community, and the related code has been made available on GitHub.

image.png

The test results showed that various models exhibited unique "personalities":

Sonnet 3.6 slightly edged out in creativity, receiving support from over 2000 netizens.

OpenAI's o1-preview, while slower in construction speed, performed excellently in recreating real buildings (such as the Taj Mahal).

o1-mini, however, was unable to complete related tasks.

Llama 3405B built a "diamond wall over a fire pit" as a symbol of itself.

Alibaba's Qwen 2.5-14B also demonstrated impressive capabilities.

It is worth noting that the AI's building process in the game does not rely on visual understanding or direct control of input devices, but instead provides context in text form and generates operational commands, similar to playing blind chess. The technical implementation mainly relies on:

mineflayer open-source library: converts AI-generated commands into executable API calls.

mindcraft open-source library: provides general prompts and examples, supporting various models to access the game.

The project team plans to further refine this assessment mechanism, creating a scoring system similar to the Lmsys arena, using the Elo algorithm to rank based on human user votes. It is reported that the complete testing environment can be set up in just 15 minutes.

This novel assessment method not only showcases the creativity of AI but also provides a fresh perspective for the objective evaluation of large model capabilities. Just as o1-preview chose to build a robot and spell out "GPT" during its free play, AI seems to have begun to express its "personality" in this virtual world. As more models join the testing, this classic game is becoming a unique platform witnessing the development of AI.

Video tutorial:

https://x.com/mckaywrigley/status/1849613686098506064

Open-source code:

https://github.com/kolbytn/mindcraft

https://github.com/mc-bench/orchestrator