You now understand Elo for games between two players. LMSYS Chatbot Arena applies the same idea to rank LLMs — with some clever adaptations.
How a "game" works in Chatbot Arena:
In chess, draws are common and scored as 0.5. Chatbot Arena does the same — a tie gives S = 0.5 to both models, producing minimal rating change (since the expected outcome between close models is already near 0.5).
In chess, the same two players face each other. In the Arena, each "game" has a different human judge with different preferences. This adds noise but averages out over thousands of votes — the law of large numbers ensures convergence.
Because individual votes are noisy, Chatbot Arena doesn't just report a point rating. They resample the vote data (bootstrapping) to compute confidence intervals. If Model A is rated 1250 ± 15, and Model B is 1245 ± 12, you can't confidently say A is better.
A model might be great at coding but mediocre at creative writing. The Arena computes separate Elo scores per category (coding, math, creative, etc.) in addition to the overall score.
Source: Chiang et al., "Chatbot Arena" (2024); Yi Zhu, "Chatbot Arena and Elo - Part 1".
Why does Chatbot Arena keep the model identities hidden from judges?
Model X is rated 1300 ± 20 and Model Y is rated 1295 ± 18. Can you confidently say X is better?
Visit arena.ai and try voting yourself. After 5–10 votes, you'll viscerally understand how "which is better?" is all you need — no rubric, no criteria, just preference.
Next: Elo as Bayesian Inference →
Questions? Ask your Copilot agent anytime.