Meta released its new flagship AI model, Maverick, on Saturday. This model achieved a second-place ranking in the LM Arena benchmark. LM Arena is a testing platform that relies on human raters to compare outputs from different models and select their preferences. However, several AI researchers quickly discovered a significant discrepancy between the version of Maverick deployed by Meta to LM Arena and the version widely used by developers.
Meta acknowledged in an announcement that the Maverick on LM Arena was an "experimental chat version." Meanwhile, a chart on the official Llama website indicates that Meta's LM Arena test used "Llama4Maverick optimized for dialogue." This discrepancy sparked questions within the research community.
AI researchers on the social media platform X noted a clear behavioral difference between the publicly downloadable Maverick and the version hosted on LM Arena. The LM Arena version was characterized by extensive use of emojis and lengthy responses, uncommon in the standard version. Researcher Nathan Lambert shared this finding on X, sarcastically commenting, "Okay, Llama4 is definitely a bit overcooked, haha. What part of Yaph City is this?", along with relevant screenshots.
This practice of tailoring a model for a specific benchmark and then releasing a supposedly "raw" version raises serious concerns. Primarily, it makes it difficult for developers to accurately predict the model's performance in real-world applications. Furthermore, this is considered misleading, as the purpose of benchmarks is to provide an objective snapshot of a single model's strengths and weaknesses across various tasks.
While LM Arena has not consistently been considered the most reliable metric for measuring AI model performance for various reasons, AI companies typically don't publicly admit to specifically optimizing models for better scores in benchmarks. Meta's approach appears to break this convention, prompting a broader discussion on the transparency of AI model evaluations.