Lesson 4: Elo for LLMs — Chatbot Arena

You now understand Elo for games between two players. LMSYS Chatbot Arena applies the same idea to rank LLMs — with some clever adaptations.

The Setup

How a "game" works in Chatbot Arena:

A user submits a prompt
Two anonymous models respond (the user doesn't know which is which)
The user picks: Model A wins, Model B wins, or Tie
Elo ratings for both models are updated based on the outcome

Why this works: Just like in ping pong, you never define "what makes a good response" on an absolute scale. You only ask: "which of these two do you prefer?" Elo converts thousands of these pairwise preferences into a global ranking.

Chess vs. Chatbot Arena

Chess Elo

Two humans play
Objective outcome (win/loss/draw)
Same player, many games
Skill changes slowly

Chatbot Arena Elo

Two LLMs respond
Subjective preference (human judge)
Many different judges
Models are fixed (until updated)

Key Adaptations

1. Handling Ties

In chess, draws are common and scored as 0.5. Chatbot Arena does the same — a tie gives S = 0.5 to both models, producing minimal rating change (since the expected outcome between close models is already near 0.5).

2. Many Judges, Not One

In chess, the same two players face each other. In the Arena, each "game" has a different human judge with different preferences. This adds noise but averages out over thousands of votes — the law of large numbers ensures convergence.

3. Bootstrap Confidence Intervals

Because individual votes are noisy, Chatbot Arena doesn't just report a point rating. They resample the vote data (bootstrapping) to compute confidence intervals. If Model A is rated 1250 ± 15, and Model B is 1245 ± 12, you can't confidently say A is better.

4. Category-Specific Ratings

A model might be great at coding but mediocre at creative writing. The Arena computes separate Elo scores per category (coding, math, creative, etc.) in addition to the overall score.

Why Not Just Use Benchmarks?

The Elo advantage for LLMs:

No fixed rubric needed — you don't have to define "good" in advance
Captures subjective quality — fluency, helpfulness, style — things hard to score automatically
Resistant to gaming — you can't optimize for a specific test set
New models slot in quickly — just play them against existing models

The cost: you need many human votes, and results are noisier than a deterministic benchmark.

Quick Check

Why does Chatbot Arena keep the model identities hidden from judges?

Model X is rated 1300 ± 20 and Model Y is rated 1295 ± 18. Can you confidently say X is better?