A new independent evaluation reveals that Meta's latest Llama 4 models, Maverick and Scout, perform admirably in standard tests but falter in complex, long-context tasks. According to the AI analytics firm's "Intelligence Index," Maverick scored 49 points, surpassing Claude 3.7 Sonnet (score not specified) but trailing Deepseek V30324 (53 points). Scout scored 36 points, comparable to GPT-4o-mini and outperforming Claude 3.5 Sonnet and Mistral Small 3.1. Both models demonstrated consistent performance in reasoning, coding, and mathematical tasks, showing no significant weaknesses.

QQ20250408-092416.png

Maverick's architectural efficiency is striking. It boasts only 17 billion active parameters compared to Deepseek V3's 37 billion (60% of the total parameters: 402 billion vs. 671 billion), and it can process images, not just text. In terms of pricing, Maverick costs $0.24/$0.77 per million input/output tokens, while Scout costs $0.15/$0.40, undercutting Deepseek V3 and even being 10 times cheaper than GPT-4o, making them among the most affordable AI models available.

However, the Llama 4 launch has sparked controversy. LMArena benchmark tests show Maverick ranking second under Meta's recommended "experimental chat version," but dropping to fifth place when "style control" is enabled, highlighting its reliance on formatting optimization rather than pure content quality. Testers questioned the reliability of Meta's benchmarks, noting significant discrepancies with performance on other platforms. Meta acknowledged optimizing the human evaluation experience but denied any data manipulation.

QQ20250408-092427.png

Long-context tasks are a clear weakness for Llama 4. Fiction.live tests revealed Maverick achieving only 28.1% accuracy at 128,000 tokens, with Scout even lower at 15.6%, significantly lagging behind Gemini 2.5 Pro's 90.6%. Although Meta claims Maverick supports a 1 million token and Scout a 10 million token context window, real-world performance falls far short. Research suggests diminishing returns from ultra-large context windows, with those under 128K being more practical.

Meta's Head of Generative AI, Ahmad Al-Dahle, responded that early inconsistencies stemmed from implementation issues, not model flaws. He denied allegations of benchmark manipulation and stated that deployment optimizations are underway, expecting stability within days.