In Lesson 1 we introduced Elo's formula for predicting wins. But why that formula? Why not some other function of the rating difference? The answer comes from a 1952 statistical model by Ralph Bradley and Milton Terry.
You have N players. You can't measure their skill directly — you can only observe pairwise outcomes. You want a model that:
That's it. The simplest possible "ratio model." If Alice has strength 3 and Bob has strength 1, Alice wins 3/(3+1) = 75% of the time.
The Bradley-Terry formula uses raw strength values. Elo uses ratings on a log scale. Here's the connection:
Step 1: Define the rating as the log of strength:
Ri = c × log₁₀(pi)
where c = 400 (Elo's scaling constant).
Step 2: Substitute into Bradley-Terry:
P(i beats j) = pi / (pi + pj)
= 1 / (1 + pj/pi)
= 1 / (1 + 10(Rj - Ri) / 400)
Result: Exactly the Elo expected score formula from Lesson 1.
Imagine each player has a random "performance time" drawn from an exponential distribution. Player i's mean time is 1/pi (stronger = faster). The probability that i finishes before j:
This falls directly out of the math of exponential distributions. It means: if you believe performance is memoryless and strength scales the rate, Bradley-Terry is the only consistent model.
The Bradley-Terry model satisfies a key axiom: adding or removing other players from the tournament doesn't change the predicted probability between any pair. The ratio pi/(pi + pj) depends only on i and j — not on who else exists.
This is exactly what you want for a rating system. Alice's chance of beating Bob shouldn't change just because Carol joined the ladder.
Take the log-odds of i beating j:
where λ = log(p). The log-odds of winning is simply the difference in log-strengths. This linearity is what makes ratings additive and interpretable: a 200-point gap always means the same thing, regardless of whether we're at 1200 vs 1400 or 2600 vs 2800.
Sources: Bradley & Terry, "Rank Analysis of Incomplete Block Designs," Biometrika 39(3/4):324–345, 1952. Wikipedia: Bradley-Terry model.
Player A has strength p=5, Player B has strength p=3. What's P(A beats B)?
Why does Elo use ratings (log-scale) instead of raw Bradley-Terry strengths?
What does "independence of irrelevant alternatives" mean for a rating system?
The Wikipedia article on Bradley-Terry gives the formal likelihood function and connects to maximum likelihood estimation. For how this underpins modern LLM evaluation, see §2 of Chiang et al. (2024).
Next: Hands-on — running your own mini Elo tournament with a script (coming soon).
Questions? Ask your Copilot agent — happy to dig into any of the math further.