Arena's 60 Million Conversations Have Convinced OpenAI, Google, and Anthropic to Pay

The UC Berkeley spinout has turned its crowdsourced AI leaderboard into a $30 million ARR enterprise business in four months.

About Arena

Published

In the race to build the best large language model, the most important judge might not be a PhD with a benchmark suite. It’s a spreadsheet user in São Paulo or a student in Seoul. Arena, which began as a side project for two UC Berkeley PhD students, has built a platform where millions of these users vote on AI model outputs in head-to-head battles. The result is a massive, real-time dataset of human preferences that has become a critical signal for the labs building the models themselves. Now, Arena is proving those same preferences are worth over $30 million a year to the companies that need to evaluate them [Founded.com, 2026].

From Chatbot Arena to a $1.7 Billion Bet

The company’s origin is a classic dorm-room story with an academic pedigree. Roommates Anastasios Angelopoulos and Wei-Lin Chiang, both PhD students, launched Chatbot Arena in 2023 to cut through the marketing noise around AI models. The premise was simple: users submit a prompt, two anonymized models generate responses, and the user votes for the better one. This created a continuously updated leaderboard based not on synthetic test scores but on perceived helpfulness and clarity. The project quickly gained traction, processing tens of millions of conversations. In 2025, they formally spun the project out as a company, bringing on UC Berkeley professor Ion Stoica as a co-founder [Founded.com, 2026].

The commercial leap came later that year with the launch of AI Evaluations, a paid service where enterprises can run their own proprietary prompts and use cases through Arena’s crowdsourced evaluation gauntlet. The bet was that the same mechanism that informed the public leaderboard could provide actionable, unbiased performance data for internal model development and procurement. The market response was immediate. Within four months of launch, the service reached an annualized revenue run rate of $30 million [X @ml_angelopoulos, 2026]. This traction fueled a staggering funding sprint: a $100 million seed round, followed just months later by a $150 million Series A at a $1.7 billion valuation [TechCrunch, Jan 2026].

The Technical Wedge: A Preference Dataset You Can't Game

The core of Arena’s defensibility is methodological. Traditional benchmarks can be optimized for, a practice known as "benchmark gaming." A model can be fine-tuned to excel at a specific test like MMLU or GSM8K without necessarily improving its general conversational utility. Arena’s crowdsourced approach sidesteps this by using a dynamic, diverse set of real-world prompts and relying on human judgment for scoring. This produces a different kind of signal,one of practical user preference.

From an infrastructure perspective, scaling this operation is its own engineering challenge. The platform now handles approximately 60 million conversations per month from over 5 million monthly users [Founded.com, 2026]. Managing this volume while ensuring low-latency model inference, maintaining blind evaluation integrity, and preventing systematic voting bias requires a robust backend. The company has grown its team to over 40 people to tackle these problems [X @ml_angelopoulos, 2026].

The commercial product, AI Evaluations, layers enterprise requirements on top of this foundation. It allows clients to:

  • Define custom evaluation suites using their own internal prompts and documents.
  • use the existing user pool as a scalable, on-demand evaluation workforce.
  • Receive detailed comparative reports that go beyond a simple ranking to show where specific models excel or fail.

The early adopters validating this approach are notable. Arena has partnered with OpenAI, Google, and Anthropic to feature their flagship models on the public leaderboard [TechCrunch, Apr 2025]. While these are not necessarily paying customers, the partnerships signal deep industry engagement and have likely paved the way for the paid enterprise service.

Funding Round Amount Valuation Key Notes
Seed (2025) $100M ~$600M Spinout from UC Berkeley research project [Founded.com, 2026].
Series A (Jan 2026) $150M $1.7B Closed months after seed; followed rapid ARR growth to $30M [TechCrunch, Jan 2026].

The Scale Test: Risks in the Crowd

For all its momentum, Arena’s model introduces unique scaling risks. The platform’s value is directly tied to the quality and representativeness of its crowd. As evaluation workloads grow,especially for lucrative enterprise contracts,maintaining a sufficiently large and demographically balanced pool of voters becomes a critical operational task. Bias in the crowd could skew results, and voter fatigue could degrade response quality.

Furthermore, the business faces the inherent tension of being both judge and paid service provider. Its most prominent partners and likely its largest customers are the very AI labs whose models it ranks. Maintaining perceived and actual independence is paramount. A misstep here could undermine the trust that is the company’s primary asset. Established evaluation firms and internal benchmarking teams at large tech companies represent the incumbent alternatives.

The Next Twelve Months

The immediate roadmap is clear: execute on the enterprise sales motion that has already shown remarkable early velocity. The company will need to transition from landing initial contracts to building a repeatable sales process and expanding within existing accounts. Key milestones to watch will be the announcement of named enterprise customers beyond the AI labs and any expansion into adjacent evaluation verticals, such as code generation or multimodal AI.

The technical breakdown is straightforward but non-trivial. Arena’s system is a distributed human-in-the-loop pipeline operating at massive scale. The key engineering challenges are data pipeline reliability, voter quality assurance, and cost-effective inference routing to various model APIs. The real moat is the accumulated dataset of hundreds of millions of preference votes,a corpus that is expensive and slow to replicate.

The sober assessment for scale is that the system’s greatest strength is also its most complex dependency. The crowd is not a software component that can be simply upgraded. Its management is a continuous exercise in incentive design, quality control, and scale economics. If the growth in paid evaluation volume outpaces the growth and retention of a high-quality voter base, the integrity of the core product could erode. For now, the market is voting with its dollars that Arena can manage it.

Sources

  1. [Founded.com, 2026] How two Berkeley roommates built a $1.7B startup that helps you rank AI models | https://www.founded.com/lmarena-arena-ai-ranking-tool-startup-founders/
  2. [TechCrunch, April 2025] AI benchmarking platform Chatbot Arena forms a new company | https://techcrunch.com/2025/04/17/ai-benchmarking-platform-chatbot-arena-forms-a-new-company/
  3. [TechCrunch, January 2026] LMArena lands $1.7B valuation four months after launching its product | https://techcrunch.com/2026/01/06/lmarena-lands-1-7b-valuation-four-months-after-launching-its-product/
  4. [X @ml_angelopoulos, 2026] Post on user and revenue metrics |
  5. [Bloomberg, June 2025] Video interview with LMArena co-founders | https://www.bloomberg.com/news/videos/2025-06-05/lmarena-co-founders-on-the-future-of-ai-rankings-video
  6. [Business Insider, September 2025] The battle of the LLMs | https://www.businessinsider.com/lmarena-cto-compare-ai-models-google-nano-banana-2025-9

Read on Startuply.vc