Recently, Meta's open-source large language model, Llama-4-Maverick, plummeted from second to 32nd place on the LMArena leaderboard, sparking widespread skepticism among developers who suspect Meta of submitting a specialized version to manipulate the rankings.

The controversy began on April 6th when Meta released its latest large language model, Llama4, encompassing three versions: Scout, Maverick, and Behemoth. Initially, Llama-4-Maverick performed impressively, securing second place on the LMArena leaderboard, trailing only Gemini2.5Pro.

However, as user feedback on the publicly available Llama4 version surfaced, the model's reputation quickly deteriorated. Developers discovered significant discrepancies between the version Meta submitted to LMArena and the openly released version, fueling allegations of ranking manipulation.

LLM Llama Math Model

Image Note: Image generated by AI, licensed by Midjourney.

According to Chatbot Arena, Meta's initial submission, Llama-4-Maverick-03-26-Experimental, was an experimentally optimized version that initially ranked second. The revised open-source version, Llama-4-Maverick-17B-128E-Instruct, despite boasting 17 billion activation parameters and 128 MoE experts, only achieved a 32nd-place ranking, significantly lagging behind top performers like Gemini2.5Pro and GPT4o, and even underperforming the Llama-3.3-Nemotron-Super-49B-v1, a model based on the previous generation.

Regarding Llama-4-Maverick-03-26-Experimental's underwhelming performance, Meta explained at a recent conference that the model was "specifically optimized for dialogue," resulting in its relatively high score on LM Arena. This optimization, while yielding high leaderboard scores, hindered accurate performance prediction in various scenarios.

A Meta spokesperson told TechCrunch that Meta will continue exploring customized versions and expects developers to adapt and improve Llama4 based on their needs. The company welcomes developers' creativity and values their feedback.