A recent study jointly released by the University of Munich, the Munich Center for Machine Learning, and Adobe Research shows that 12 leading AI language models, including GPT-4o, Gemini1.5Pro, and Llama-3.3-70B, face significant performance degradation in long-text conceptual reasoning tasks. Although these models support context processing of at least 128,000 tokens, their deep logical connection capabilities still have fundamental limitations.  

The research team developed the NOLIMA (No Word Matching) benchmark testing system, which deliberately avoids keyword repetition to reveal the vulnerabilities of AI models in concept linkage. For example, when the text describes "Yuki lives next to Semperoper," the model must first understand the common knowledge that "Semperoper is located in Dresden" to answer "who has been to Dresden."

Robot typing at work

Image Source Note: Image generated by AI, image authorized by service provider Midjourney

Test results show:  

1. **Dramatic decline in long-text performance**: When the context expands from 2,000 to 8,000 tokens, most models show significant performance drops; in the 32,000-token scenario, 10 out of 12 models perform at only half of their short-text capabilities.  

2. **Attention mechanism exposes weaknesses**: Models struggle to accurately locate relevant information in long texts, and accuracy further decreases when key answers appear in the latter half of the text.  

3. **Dedicated reasoning models still have flaws**: The o1, o3-mini, and DeepSeek-R1 systems designed for complex reasoning scored below 50% on the 32K token NOLIMA-Hard test, despite performing nearly perfectly on short texts.  

The research points out that the core issue is the models' over-reliance on the inertia of "word matching." When tests deliberately exclude identical vocabulary, even with the use of Chain of Thought (CoT) prompting techniques, Llama-3.3-70B's long-text processing capabilities see limited improvement. More critically, if there is word matching interference in irrelevant contexts, it can exacerbate the model's misjudgments.  

"This reveals a fundamental contradiction in current AI—it's easy to expand the context window, but difficult to enhance deep reasoning capabilities," the researchers emphasize. Taking GPT-4o as an example, although it achieves an effective context length of 8,000 tokens, it still struggles with cross-paragraph concept integration. As the text lengthens, the model's attention mechanism gradually "loses focus," making it difficult to maintain a coherent logical chain.  

This research serves as a wake-up call for AI development: merely increasing processing length cannot break through reasoning bottlenecks. The industry needs to re-examine model architecture design and develop more efficient information extraction and linking mechanisms. In the future, how to enable AI to truly understand text rather than rely on pattern matching will be key to breaking through the limits of long-text processing.