WWW.404MEDIA.CO
Researchers Say the Most Popular Tool for Grading AIs Unfairly Favors Meta, Google, OpenAI
The most popular method for measuring what are the best chatbots in the world is flawed and frequently manipulated by powerful companies like OpenAI and Google in order to make their products seem better than they actually are, according to a new paper from researchers at the AI company Cohere, as well as Stanford, MIT, and other universities.The researchers came to this conclusion after reviewing data thats made public by Chatbot Arena (also known as LMArena and LMSYS), which facilitates benchmarking and maintains the leaderboard listing the best large language models, as well as scraping Chatbot Arena and their own testing. Chatbot Arena, meanwhile, has responded to the researchers findings by saying that while it accepts some criticisms and plans to address them, some of the numbers the researchers presented are wrong and mischaracterize how Chatbot Arena actually ranks LLMs. The research was published just weeks after Meta was accused of gaming AI benchmarks with one of its recent models.If youre wondering why this beef between the researchers, Chatbot Arena, and others in the AI industry matters at all, consider the fact that the biggest tech companies in the world as well as a great number of lesser known startups are currently in a fierce competition to develop the most advanced AI tools, operating under the belief that these AI tools will define the future of humanity and enrich the most successful companies in this industry in a way that will make previous technology booms seem minor by comparison.I should note here that Cohere is an AI company that produces its own models and that they dont appear to rank very highly in the Chatbot Arena leaderboard. The researchers also make the point that proprietary closed models from competing companies appear to have an unfair advantage to open-source models, and that Cohere proudly boasts that its model Aya is one of the largest open science efforts in ML to date. In other words, the research is coming from a company that Chatbot Arena doesnt benefit.Judging which large language model is the best is tricky because different people use different AI models for different purposes and what is the best result is often subjective, but the desire to compete and compare these models has made the AI industry default to the practice of benchmarking AI models. Specifically, Chatbot Arena, which gives a numerical Arena Score to models companies submit and maintains a leaderboard listing the highest scoring models. At the moment, for example, Googles Gemini 2.5 Pro is in the number one spot, followed by OpenAIs o3, ChatGPT 4o, and Xs Grok 3.The vast majority of people who use these tools probably have no idea the Chatbot Arena leaderboard exists, but it is a big deal to AI enthusiasts, CEOs, investors, researchers, and anyone who actively works or is invested in the AI industry. The significance of the leaderboard also remains despite the fact that it has been criticized extensively over time for the reasons I list above. The stakes of the AI race and who will win it are objectively very high in terms of the money thats being poured into this space and the amount of time and energy people are spending on winning it, and Chatbot Arena, while flawed, is one of the few places thats keeping score.A meaningful benchmark demonstrates the relative merits of new research ideas over existing ones, and thereby heavily influences research directions, funding decisions, and, ultimately, the shape of progress in our field, the researchers write in their paper, titled The Leaderboard illusion. The recent meteoric rise of generative AI modelsin terms of public attention, commercial adoption, and the scale of compute and funding involvedhas substantially increased the stakes and pressure placed on leaderboards.The way that Chatbot Arena works is that anyone can go to its site and type in a prompt or question. That prompt is then given to two anonymous models. The user cant see what the models are, but in theory one model could be ChatGPT while the other is Anthropics Claude. The user is then presented with the output from each of these models and votes for the one they think did a better job. Multiply this process by millions of votes and thats how Chatbot Arena determines who is placed where on the leaderboards. Deepseek, the Chinese AI model that rocked the industry when it was released in January, is currently ranked #7 on the leaderboard, and its high score was part of the reason people were so impressed.According to the researchers paper, the biggest problem with this method is that Chatbot Arena is allowing the biggest companies in this space, namely Google, Meta, Amazon, and OpenAI, to run undisclosed private testing and cherrypick their best model. The researchers said their systemic review of Chatbot Arena involved combining data sources encompassing 2 million battles, auditing 42 providers and 243 models between January 2024 and April 2025.This comprehensive analysis reveals that over an extended period, a handful of preferred providers have been granted disproportionate access to data and testing, the researchers wrote. In particular, we identify an undisclosed Chatbot Arena policy that allows a small group of preferred model providers to test many model variants in private before releasing only the best-performing checkpoint.Basically, the researchers claim that companies test their LLMs on Chatbot Arena to find which models score best, without those tests counting towards their public score. Then they pick the model that scores best for official testing.Chatbot Arena says the researchers framing here is misleading.We designed our policy to prevent model providers from just reporting the highest score they received during testing. We only publish the score for the model they release publicly, it said on X.In a single month, we observe as many as 27 models from Meta being tested privately on Chatbot Arena in the lead up to Llama 4 release, the researchers said. Notably, we find that Chatbot Arena does not require all submitted models to be made public, and there is no guarantee that the version appearing on the public leaderboard matches the publicly available API.In early April, when Metas model Maverick shot up to the second spot of the leaderboard, users were confused because they didnt find it that good and better than other models that ranked below it. As Techcrunch noted at the time, that might be because Meta used a slightly different version of the model optimized for conversationality on Chatbot Arena than what users had access to.We helped Meta with pre-release testing for Llama 4, like we have helped many other model providers in the past, Chatbot Arena said in response to the research paper. We support open-source development. Our own platform and analysis tools are open source, and we have released millions of open conversations as well. This benefits the whole community.The researchers also claim that makers or proprietary models, like OpenAI and Google, collect far more data from their testing on Chatbot Arena than fully open-source models, which allows them to better fine tune the model to what Chatbot Arena users want.That last part on its own might be the biggest problem with Chatbot Arenas leaderboard in the long term, since it incentivizes the people who create AI models to design them in a way that scores well on Chatbot Arena as opposed to what might make them materially better and safer for users in a real world environment.As the researchers write: the over-reliance on a single leaderboard creates a risk that providers may overfit to the aspects of leaderboard performance, without genuinely advancing the technology in meaningful ways. As Goodharts Law states, when a measure becomes a target, it ceases to be a good measure.Despite their criticism, the researchers acknowledge the contribution of Chatbot Arena to AI research and that it serves a need, and their paper ends with a list of recommendations on how to make it better, including preventing companies from retracting scores after submission, being more transparent which models engage in private testing and how much.One might disagree with human preferencestheyre subjectivebut thats exactly why they matter, Chatbot Arena said on X in response to the paper. Understanding subjective preference is essential to evaluating real-world performance, as these models are used by people. Thats why were working on statistical methodslike style and sentiment controlto decompose human preference into its constituent parts. We are also strengthening our user base to include more diversity. And if pre-release testing and data helps models optimize for millions of peoples preferences, thats a positive thing!If a model provider chooses to submit more tests than another model provider, this does not mean the second model provider is treated unfairly, it added. Every model provider makes different choices about how to use and value human preferences.
0 Commentarios
0 Acciones
8 Views
0 Vista previa