Study Accuses LM Arena of Favoritism in AI Benchmarking

A recent study by Cohere, Stanford, MIT, and Ai2 accuses LM Arena of allowing select AI companies like Meta and Google to privately test AI models, skewing leaderboard results.

LM Arena is under scrutiny following a study by Cohere, Stanford University, MIT CSAIL, and Ai2, which accuses the platform of favoring major AI labs in its benchmarking processes. The study claims that LM Arena allowed companies like Meta, OpenAI, Google, and Amazon to privately test multiple AI model variants, selectively publishing only the top-performing results.

The research, which involved analyzing over 2.8 million Chatbot Arena battles, suggests that these practices gave certain companies an unfair advantage by allowing them to optimize their models for better leaderboard scores. The study highlights that Meta, for instance, privately tested 27 model variants before the release of its Llama 4 model, only revealing the score of the highest-ranking variant.

LM Arena has refuted these claims, stating that the study contains inaccuracies and that their benchmarking process remains fair and community-driven. They argue that the ability to submit more tests does not equate to unfair treatment of other model providers.

The study calls for LM Arena to implement changes to ensure transparency and fairness, such as setting limits on private tests and publicly disclosing all test scores. LM Arena has acknowledged some recommendations, indicating plans to adjust their sampling algorithm to ensure equal representation of models in battles.

We hope you enjoyed this article.

Consider subscribing to one of several newsletters we publish. For example, in the Daily AI Brief you can read the most up to date AI news round-up 6 days per week.

Also, consider following us on social media:

Subscribe to Daily AI Brief

Daily report covering major AI developments and industry news, with both top stories and complete market updates

Market report

AI’s Time-to-Market Quagmire: Why Enterprises Struggle to Scale AI Innovation

ModelOp

The 2025 AI Governance Benchmark Report by ModelOp provides insights from 100 senior AI and data leaders across various industries, highlighting the challenges enterprises face in scaling AI initiatives. The report emphasizes the importance of AI governance and automation in overcoming fragmented systems and inconsistent practices, showcasing how early adoption correlates with faster deployment and stronger ROI.

Read more