Study Accuses LM Arena of Favoritism in AI Benchmarking

May 01, 2025

A recent study by Cohere, Stanford, MIT, and Ai2 accuses LM Arena of allowing select AI companies like Meta and Google to privately test AI models, skewing leaderboard results.

LM Arena is under scrutiny following a study by Cohere, Stanford University, MIT CSAIL, and Ai2, which accuses the platform of favoring major AI labs in its benchmarking processes. The study claims that LM Arena allowed companies like Meta, OpenAI, Google, and Amazon to privately test multiple AI model variants, selectively publishing only the top-performing results.

The research, which involved analyzing over 2.8 million Chatbot Arena battles, suggests that these practices gave certain companies an unfair advantage by allowing them to optimize their models for better leaderboard scores. The study highlights that Meta, for instance, privately tested 27 model variants before the release of its Llama 4 model, only revealing the score of the highest-ranking variant.

LM Arena has refuted these claims, stating that the study contains inaccuracies and that their benchmarking process remains fair and community-driven. They argue that the ability to submit more tests does not equate to unfair treatment of other model providers.

The study calls for LM Arena to implement changes to ensure transparency and fairness, such as setting limits on private tests and publicly disclosing all test scores. LM Arena has acknowledged some recommendations, indicating plans to adjust their sampling algorithm to ensure equal representation of models in battles.

We hope you enjoyed this article.

Consider subscribing to one of our newsletters like Daily AI Brief.

Also, consider following us on social media:

AI Brief AI Brief (X)

Subscribe to Daily AI Brief

Daily report covering major AI developments and industry news, with both top stories and complete market updates

Market report

AI’s Time-to-Market Quagmire: Why Enterprises Struggle to Scale AI Innovation

ModelOp

The 2025 AI Governance Benchmark Report by ModelOp provides insights from 100 senior AI and data leaders across various industries, highlighting the challenges enterprises face in scaling AI initiatives. The report emphasizes the importance of AI governance and automation in overcoming fragmented systems and inconsistent practices, showcasing how early adoption correlates with faster deployment and stronger ROI.

Categories

Companies

Resources

Study Accuses LM Arena of Favoritism in AI Benchmarking

We hope you enjoyed this article.

Subscribe to Daily AI Brief

Market report

AI’s Time-to-Market Quagmire: Why Enterprises Struggle to Scale AI Innovation

You May Also Like

OpenAI Research Tackles AI Scheming with New Techniques

Thinking Machines Lab Tackles AI Model Consistency

Meta's Llama AI Approved for U.S. Government Use

xAI Sues OpenAI Over Alleged Trade Secret Theft

FTC Investigates AI Chatbot Safety for Minors

FTC to Investigate AI Chatbot Risks to Children

Lanai Launches Edge-Based AI Observability Agent

Sam Altman Questions Authenticity of Social Media Due to Bots

Google Introduces Stax for AI Evaluation

OpenAI's GDPval Benchmark Evaluates AI in Real-World Jobs

OpenAI's Revenue Growth and Anthropic's Claude Sonnet 4.5 Launch

YouTube Music Tests AI Hosts for Enhanced Listening Experience