Meta executives recently addressed allegations of "improper training" of their new AI model, Llama4, on social media, stating that the claims are completely unfounded. The accusations alleged that Meta artificially inflated the performance of its newly released Llama4Maverick and Llama4Scout models by training them on the "test set" of specific benchmark tests.

Ahmad Al-Dahle, Meta's VP of Generative AI, responded on X (formerly Twitter), stating that these claims are baseless. He pointed out that test sets are used to evaluate model performance, and training on them would indeed make a model appear far better than it actually is—a practice considered unethical in the industry.

LLM Llama Math Model

Image Source Note: Image generated by AI, licensed through Midjourney

However, it's noteworthy that Llama4Maverick and Llama4Scout did underperform on certain tasks. Meta admitted to using an unreleased experimental version of Maverick on the benchmark platform LM Arena to achieve higher scores, which inadvertently provided some "evidence" for the rumors. Researchers have found significant behavioral differences between the publicly available Maverick and the version hosted on LM Arena.

Al-Dahle also acknowledged that some users experienced inconsistent quality when using Llama4 models from different cloud providers. He explained, "Because we released our models quickly after they were ready, it is expected that it will take several days to align all publicly available versions. We will continue to implement bug fixes and communicate with our partners."

Meta's clarification suggests that the company maintains trustworthy ethical standards in the AI field, while also serving as a reminder that any AI model's performance can vary significantly depending on the version.