Lesson 4: Elo for LLMs — Chatbot Arena

You now understand Elo for games between two players. LMSYS Chatbot Arena applies the same idea to rank LLMs — with some clever adaptations.

The Setup

How a "game" works in Chatbot Arena:

  1. A user submits a prompt
  2. Two anonymous models respond (the user doesn't know which is which)
  3. The user picks: Model A wins, Model B wins, or Tie
  4. Elo ratings for both models are updated based on the outcome
Why this works: Just like in ping pong, you never define "what makes a good response" on an absolute scale. You only ask: "which of these two do you prefer?" Elo converts thousands of these pairwise preferences into a global ranking.

Chess vs. Chatbot Arena

Chess Elo
Chatbot Arena Elo

Key Adaptations

1. Handling Ties

In chess, draws are common and scored as 0.5. Chatbot Arena does the same — a tie gives S = 0.5 to both models, producing minimal rating change (since the expected outcome between close models is already near 0.5).

2. Many Judges, Not One

In chess, the same two players face each other. In the Arena, each "game" has a different human judge with different preferences. This adds noise but averages out over thousands of votes — the law of large numbers ensures convergence.

3. Bootstrap Confidence Intervals

Because individual votes are noisy, Chatbot Arena doesn't just report a point rating. They resample the vote data (bootstrapping) to compute confidence intervals. If Model A is rated 1250 ± 15, and Model B is 1245 ± 12, you can't confidently say A is better.

4. Category-Specific Ratings

A model might be great at coding but mediocre at creative writing. The Arena computes separate Elo scores per category (coding, math, creative, etc.) in addition to the overall score.

Source: Chiang et al., "Chatbot Arena" (2024); Yi Zhu, "Chatbot Arena and Elo - Part 1".

Why Not Just Use Benchmarks?

The Elo advantage for LLMs: The cost: you need many human votes, and results are noisier than a deterministic benchmark.

Quick Check

Why does Chatbot Arena keep the model identities hidden from judges?

Model X is rated 1300 ± 20 and Model Y is rated 1295 ± 18. Can you confidently say X is better?

Recommended Reading

Visit arena.ai and try voting yourself. After 5–10 votes, you'll viscerally understand how "which is better?" is all you need — no rubric, no criteria, just preference.