AI Showdown in Minecraft! Claude's New Version Amazes the Internet with Its Building Skills

AIbase基地

Published inAI News · 4 min read · Nov 15, 2024

439

Recently, a unique AI capability assessment took place on the "Minecraft" platform, attracting significant attention. The new and old versions of Claude 3.5 Sonnet competed in building challenges, showcasing clear differences in abilities, with the new version (tentatively named "Sonnet 3.6") performing particularly well.

This test, initiated by developer adi, has been humorously dubbed the "only reliable benchmark." Benchmark researcher Aidan McLau believes this method perfectly meets the current needs for AI assessment and points out that aesthetic ability is closely related to intellectual level. The project quickly gained support from the open-source community, and the related code has been made available on GitHub.

The test results showed that various models exhibited unique "personalities":

Sonnet 3.6 slightly edged out in creativity, receiving support from over 2000 netizens.

OpenAI's o1-preview, while slower in construction speed, performed excellently in recreating real buildings (such as the Taj Mahal).

o1-mini, however, was unable to complete related tasks.

Llama 3405B built a "diamond wall over a fire pit" as a symbol of itself.

Alibaba's Qwen 2.5-14B also demonstrated impressive capabilities.

It is worth noting that the AI's building process in the game does not rely on visual understanding or direct control of input devices, but instead provides context in text form and generates operational commands, similar to playing blind chess. The technical implementation mainly relies on:

mineflayer open-source library: converts AI-generated commands into executable API calls.

mindcraft open-source library: provides general prompts and examples, supporting various models to access the game.

The project team plans to further refine this assessment mechanism, creating a scoring system similar to the Lmsys arena, using the Elo algorithm to rank based on human user votes. It is reported that the complete testing environment can be set up in just 15 minutes.

This novel assessment method not only showcases the creativity of AI but also provides a fresh perspective for the objective evaluation of large model capabilities. Just as o1-preview chose to build a robot and spell out "GPT" during its free play, AI seems to have begun to express its "personality" in this virtual world. As more models join the testing, this classic game is becoming a unique platform witnessing the development of AI.

Video tutorial:

https://x.com/mckaywrigley/status/1849613686098506064

Open-source code:

https://github.com/kolbytn/mindcraft

https://github.com/mc-bench/orchestrator

AI Capability Assessment Claude3.5Sonnet Sonnet3.6 Open Source Community

This article is from AIbase Daily

Welcome to the [AI Daily] column! This is your daily guide to exploring the world of artificial intelligence. Every day, we present you with hot topics in the AI field, focusing on developers, helping you understand technical trends, and learning about innovative AI product applications.

—— Created by the AIbase Daily Team

AI News Recommendations

OpenAI's Latest Benchmark Test: AI Programming Ability Matches One-Quarter of Humans, Revealing Limitations

Recently, OpenAI released a significant report on AI programming capabilities, highlighting the current state of AI in software development through a $1 million real-world development project. The benchmark test, named SWE-Lancer, covered 1,400 real projects from Upwork, comprehensively assessing AI performance in both direct development and project management areas. The results indicated that the best-performing AI model, Claude 3.5 Sonnet, achieved a success rate of 26.2% in coding tasks and reported performance in project management.

Feb 20, 2025

1.8k

Someone Combined DeepSeek-R1 and Claude 3.5 Sonnet, and the Results Are Stunning!

DeepClaude is an open-source AI project that passes the inference process of DeepSeek-R1 to Claude 3.5 Sonnet, aiming to leverage the advantages of both models to produce higher quality content. Introduction to DeepClaude: DeepClaude is an open-source project that combines the reasoning capabilities of DeepSeek-R1 with the powerful functions of Claude 3.5 Sonnet.

Feb 11, 2025

3.0k

Fei-Fei Li: AI Policies Should Be Based on Science, Not Sci-Fi

Fei-Fei Li, a Stanford University computer scientist and founder of a startup, known as the 'Godmother of AI,' has proposed the 'three fundamental principles for future AI policy making' ahead of the upcoming Paris AI Action Summit next week. She emphasizes that AI policies must be based on 'science rather than science fiction.' Li believes that policymakers should focus on the realities of AI, rather than constructing grand future scenarios built on utopia or apocalyptic visions. She specifically points out that chatbots and co-pilot programs are not 'intelligent agents with intent, free will, or consciousness.'

Feb 9, 2025

3.5k

ByteDance Releases Doubao Large Model 1.5 Pro, Performance Surpassing GPT-4o and Claude3.5Sonnet

ByteDance officially launches its latest Doubao large model 1.5 Pro (Doubao-1.5-pro), which demonstrates outstanding comprehensive capabilities in various fields, successfully surpassing the well-known GPT-4o and Claude3.5Sonnet in the industry. The release of this model marks an important step forward for ByteDance in the field of artificial intelligence. Doubao 1.5 Pro adopts a novel sparse MoE (Mixture of Experts) architecture, utilizing a smaller set of activation parameters for pre-training. This design's innovation...

Jan 22, 2025

31.3k

Breakthrough in Domestic Large Models! DeepSeek V3 Challenges Claude 3.5 Sonnet - A Comprehensive Test Record

Recently, the domestic large model DeepSeek V3 has drawn industry attention due to its outstanding performance in the AI arena. As the only open-source model to break into the top ten, it not only surpassed o1-mini but even exceeded Claude 3.5 Sonnet in various fields, including programming and mathematics. To verify its actual capabilities, a series of comparative tests were conducted. In the basic comprehension ability test, both models exhibited different characteristics. Faced with the Chinese riddle 'Xiaoming's mother has three children', DeepSeek V3 performed...

Dec 31, 2024

8.5k

Anthropic Challenges OpenAI's Dominance with Tenfold Growth Momentum

Dec 13, 2024

3.2k

Major Update to the Claude AI Model Series: Claude 3.5 Haiku Launched, Significantly Enhanced AI Performance

Nov 5, 2024

2.8k

AI Daily: Claude Adds PDF Processing Feature; Runway Launches Advanced Camera Control; Open Source Tool ComfyUI-MochiEdit Supports Video-to-Video

Welcome to the AI Daily section! Here is your daily guide to explore the world of artificial intelligence. Every day, we present to you the hot topics in the AI field, focusing on developers to help you gain insights into technology trends and understand innovative AI product applications. Click to learn about fresh AI products: https://top.aibase.com/1. The Claude 3.5 Sonnet model has added PDF file processing capabilities. The latest launch from Anthropic, the Claude 3.5 Sonnet model, has introduced the ability to process PDF files.

Nov 4, 2024

510

AI Boom Drives Python Beyond JavaScript as the Most Popular Programming Language on GitHub

In the latest report from the developer platform GitHub, Python has successfully surpassed JavaScript to become the most widely used programming language. This shift is primarily attributed to the ongoing boom in generative artificial intelligence (AI). The increasing significance of Python in machine learning, data science, and scientific computing has propelled its rise in the open source community. GitHub notes that Python's popularity is linked to the growing number of STEM (Science, Technology, Engineering, and Mathematics) developers.

Nov 4, 2024

3.8k

Claude 3.5 Sonnet Model Adds PDF File Processing Functionality for Analyzing Document Images, Charts, and Tables

Nov 4, 2024

4.1k

AI News

AI Daily

AI Timeline

Al Hardware

Latest Cases

Image Collection

Video Collection

Audio Collection

Content Collection

Latest Tutorials

AI Product Ranking

AI Traffic Growth Ranking

AI Traffic Decline Ranking

AI Weekly Ranking

United States

China

India

Brazil

Image Generation

Personal Assistant

Character Generation

Video Generation

AI Project Ranking

AI Project Growth Ranking

AI Developer Ranking

AI Organization Ranking

Deepseek

TTS

LLM

ChatGPT

Overview

AI Showdown in Minecraft! Claude's New Version Amazes the Internet with Its Building Skills

AIbase基地

This article is from AIbase Daily

AI News Recommendations

OpenAI's Latest Benchmark Test: AI Programming Ability Matches One-Quarter of Humans, Revealing Limitations

Someone Combined DeepSeek-R1 and Claude 3.5 Sonnet, and the Results Are Stunning!

Fei-Fei Li: AI Policies Should Be Based on Science, Not Sci-Fi

ByteDance Releases Doubao Large Model 1.5 Pro, Performance Surpassing GPT-4o and Claude3.5Sonnet

Breakthrough in Domestic Large Models! DeepSeek V3 Challenges Claude 3.5 Sonnet - A Comprehensive Test Record

Anthropic Challenges OpenAI's Dominance with Tenfold Growth Momentum

Major Update to the Claude AI Model Series: Claude 3.5 Haiku Launched, Significantly Enhanced AI Performance

AI Daily: Claude Adds PDF Processing Feature; Runway Launches Advanced Camera Control; Open Source Tool ComfyUI-MochiEdit Supports Video-to-Video

AI Boom Drives Python Beyond JavaScript as the Most Popular Programming Language on GitHub

Claude 3.5 Sonnet Model Adds PDF File Processing Functionality for Analyzing Document Images, Charts, and Tables