AI News

Don't miss any moment of global AI innovation

AI Daily

Daily three-minute AI industry trends

AI Timeline

AI industry milestones

Al Hardware

Lists all AI hardware products.

AI Monetization Guide

Latest Cases

AI monetization case sharing

Image Collection

AI image creation monetization cases

Video Collection

AI video creation monetization cases

Audio Collection

AI audio creation monetization cases

Content Collection

AI content writing monetization cases

AI Tutorials

Latest Tutorials

Free sharing of the latest AI tutorials

AI Product Rankings

AI Product Ranking

Shows total visits ranking of AI websites

AI Traffic Growth Ranking

Track fastest growing AI websites by traffic

AI Traffic Decline Ranking

Focus on AI websites with significant traffic drops

AI Weekly Ranking

Shows weekly visits ranking of AI websites

Popular Country Rankings

United States

AI websites most popular with US users

China

AI websites most popular with Chinese users

India

AI websites most popular with Indian users

Brazil

AI websites most popular with Brazilian users

Popular Category Rankings

Image Generation

Total visits ranking of AI image generation websites

Personal Assistant

Total visits ranking of AI personal assistant websites

Character Generation

Total visits ranking of AI character generation websites

Video Generation

Total visits ranking of AI video generation websites

Popular Open Source Data Rankings

AI Project Ranking

GitHub popular AI projects by total stars

AI Project Growth Ranking

GitHub popular AI projects by growth rate

AI Developer Ranking

GitHub popular AI developer ranking

AI Organization Ranking

GitHub popular AI organization ranking

Popular Open Source Categories

Deepseek

GitHub popular deepseek open source projects

TTS

GitHub popular TTS open source projects

LLM

GitHub popular LLM open source projects

ChatGPT

GitHub popular ChatGPT open source projects

AI Open Source Project Library

Overview

Overview of GitHub popular AI open source projects

Product Library Tool Navigation MCP

DeepMind Launches New Benchmark Michelangelo: Revealing Long Context LLM Reasoning Flaws

AIbase基地

Published inAI News · 6 min read · Oct 11, 2024

170

Recently, large language models (LLMs) with ultra-long context windows have become a hot topic of discussion. These models can process hundreds of thousands to millions of tokens in a single prompt, opening up many new possibilities for developers. However, how well can these long-context LLMs understand and utilize the vast amount of information they receive?

To address this issue, researchers at Google DeepMind have introduced a new benchmark called Michelangelo, designed to evaluate the long-context reasoning capabilities of models.

The study results indicate that while current top-tier models have made progress in extracting information from large context data, they still face difficulties in tasks requiring reasoning and understanding of data structures.

As LLMs with long context windows emerge, researchers are beginning to realize the need for new benchmarks to assess these models' capabilities. Existing evaluations often focus on information retrieval tasks, such as "finding a needle in a haystack," which involves searching for specific information within a large context. However, simple retrieval does not equate to an understanding of the overall context.

To tackle these challenges, Michelangelo proposes a novel evaluation method, setting complex tasks that require models to perform deeper reasoning and synthesis when processing long texts. For example, the evaluation framework includes multiple tasks related to programming and natural language, which not only test the model's memory capabilities but also emphasize its depth of understanding and information processing.

In Michelangelo's evaluation tasks, models are required to solve three basic types of long-document synthesis tasks: "Latent List," "Multi-round Coreference Resolution," and various other application scenarios. These tasks not only help assess the model's performance in long documents but also reveal its shortcomings in reasoning and synthesis.

The first task is "Latent List," where the model needs to process a long series of Python list operations, filtering out irrelevant or redundant statements to determine the list's final state.
The second task is "Multi-round Coreference Resolution," where the model must understand the dialogue structure in a long conversation and resolve reference issues.
The third task is "I Don't Know," where the model, when answering multiple-choice questions, needs to determine if the context contains the answer and accurately respond with "I Don't Know."

Researchers evaluated ten top-tier LLMs (including different versions of Gemini, GPT-4, and Claude) on Michelangelo, testing the models in contexts with up to 1 million tokens. The Gemini model performed best on MRCR, the GPT model excelled on Latent List, and Claude 3.5 Sonnet scored highest on IDK.

Researchers found that although these models varied in their performance in handling long contexts, their overall performance significantly declined when faced with more complex reasoning tasks.

This indicates that even with ultra-long context windows, current LLMs still need to improve their reasoning capabilities.

Researchers plan to continuously expand the evaluation projects of Michelangelo and hope to make it directly available for other researchers to test their models.

Paper link: https://arxiv.org/abs/2409.12640

Key points:
🔍 The new benchmark Michelangelo for long-context LLMs aims to evaluate the models' reasoning capabilities.
🧩 Studies show a significant performance drop in existing models when handling complex reasoning tasks.
📈 Researchers plan to expand the evaluation projects to further promote research on model reasoning capabilities.

Large Language Models Long Context Michelangelo Google DeepMind

This article is from AIbase Daily

Welcome to the [AI Daily] column! This is your daily guide to exploring the world of artificial intelligence. Every day, we present you with hot topics in the AI field, focusing on developers, helping you understand technical trends, and learning about innovative AI product applications.

—— Created by the AIbase Daily Team

AI News Recommendations

Major Breakthrough! Research Team Reveals the Hidden Reward Mechanism Inside Large Language Models

Jul 2, 2025

150

JD.com's Embodied Intelligence Strategy Accelerates Rapidly, JoyInside Collaboration Map Exposed

According to NetEase Technology, JD.com's layout in the field of embodied intelligence is accelerating rapidly. The embodied intelligence brand JoyInside under JD.com has reached cooperation with more than ten leading robot companies, becoming the core engine for JD.com to seize the smart robot market. According to insiders, JoyInside is supported by JD's large model technology, focusing on providing smart interaction capabilities between robots and consumers. Its product strategy focuses on scenario-based applications such as one person, one dog, and one toy. Since its launch, the brand has successfully attracted leading enterprises from multiple niche fields to join.

Jul 2, 2025

240

Foxconn Launches Its First AI Inference Large Model FoxBrain, Trademark Application Submitted

Recently, Hon Hai Precision Industrial Co., Ltd. (commonly known as Foxconn) submitted a trademark registration application for "FoxBrain" to the Trademark Office of the National Intellectual Property Administration. This AI inference large model is not only Foxconn's first attempt but also the first AI model of this type in Taiwan. According to public information, the international classification of this trademark is scientific instruments, and it is currently in the "waiting for substantive examination" status. "FoxBrain" is an AI inference large model launched by the Hon Hai Research Institute, covering data analysis

Jul 2, 2025

250

Zhipu AI Launches GLM-4.1V-Thinking Open Source! A New Leader in Multimodal Reasoning, Challenging Top Models Worldwide

Jul 2, 2025

270

AI Daily: Baidu Launches Drawn-Imagine Platform and MuseSteamer; Alibaba's Audio-Driven Full-Body Digital Human Model OmniAvatar

Welcome to the [AI Daily] section! Here is your guide to exploring the world of artificial intelligence every day. Every day, we present you with the latest content in the AI field, focusing on developers, helping you understand technical trends and learn about innovative AI product applications. Click to learn more about new AI products: https://top.aibase.com/1、Open Source End-to-End Speech Large Model Step-Audio-AQAA: Understand audio and directly generate natural speech. Step-Audio-AQAA is an open source end-to-end speech large model,

Jul 2, 2025

240

Open Source End-to-End Speech Large Model Step-Audio-AQAA: Understand Audio and Generate Natural Speech Directly

Jul 2, 2025

220

Foxconn's Parent Company Registers a Trademark for an AI Inference Large Model

Jul 2, 2025

Gemini Live Will Be Fully Integrated into Google Apps, Making the AI Assistant Smarter!

Jul 2, 2025

160

Gemini Live Makes a Major Upgrade! Seamless Integration with Google Apps, Smart Life Within Reach

With the rapid development of artificial intelligence technology, Google's AI assistant Gemini Live has undergone a major upgrade. According to the latest information obtained by AIbase, Gemini Live is about to achieve deep integration with multiple Google apps, providing users with a more intelligent and efficient interaction experience. This feature not only enhances productivity but will also completely change the way users interact with the Google ecosystem. Seamless connection with Google apps, smarter operations are now more convenient. Latest news shows

Jul 2, 2025

100

Google Data Center Power Consumption Has Increased Sevenfold in Ten Years, Huge Investments Bet on a Carbon-Neutral Future

Google's latest sustainability report reveals a startling fact: within just four years, the company's data center power consumption more than doubled, rising from 14.4 million megawatt-hours in 2020 to 30.8 million megawatt-hours in 2024. If the timeline is extended to ten years, compared to an estimated 4 million megawatt-hours in 2014, Google's data center power consumption has increased sevenfold. Growing electricity demand: data centers are major energy consumers, efficiency improvements face bottlenecks. Data shows that Google's power issues are almost entirely concentrated in data centers.

Jul 2, 2025

100