Lesson 3: Convergence — How Ratings Stabilize

← Lesson 2 | Lesson 4 →

You now know the formula and the update rule. But does the system actually work? Do ratings converge to something meaningful, or do they just bounce around forever?

The Self-Correcting Mechanism

Built-in negative feedback: If your rating is too high relative to your true skill, you'll lose more often than predicted → your rating drops. If too low, you'll win more often → rating rises. The system always pushes ratings toward a stable equilibrium.

This happens because (S - E) has a sign:

Overrated player: actual wins < expected wins → S - E is negative on average → rating decreases
Underrated player: actual wins > expected wins → S - E is positive on average → rating increases

How Many Games to Stabilize?

This depends on K and the rating pool:

Rule of thumb: With K = 32, a player's rating typically stabilizes within 20–30 games against opponents of varied strength. FIDE uses a "provisional" label for the first 30 rated games.

For a small ping pong ladder (8–12 people), expect ratings to feel "right" after everyone has played roughly 15–20 matches each.

The Starting Rating Problem

Everyone has to start somewhere. Common approaches:

Fixed start (e.g., everyone begins at 1500). Simple, but early games produce wild swings as the system figures out who's actually good.
Provisional period with high K. New players use K = 40–64 for their first N games, then drop to K = 32. This lets them find their level quickly without permanently distorting others' ratings.
Placement matches. Play several games before assigning a rating. Chatbot Arena does this — a new model is matched broadly at first to quickly estimate its strength.

For your ping pong ladder: Start everyone at 1500 with K = 40 for the first 10 games, then switch to K = 32. Ratings will feel reasonable after 2–3 weeks of regular play.

When Elo Doesn't Stabilize Well

Watch out for these failure modes:

Too few games — ratings haven't had enough data to converge. Common in small pools.
Non-transitive matchups — A beats B, B beats C, C beats A (like rock-paper-scissors). Elo assumes a single linear skill dimension and can't represent this.
Skill changes over time — if players improve rapidly, a low K won't keep up. This is why FIDE uses higher K for juniors.
Rating deflation/inflation — in closed pools, if new players always start at 1500 and then lose, they "donate" points to the pool, inflating everyone else. Or they leave with points, deflating the pool.

Source: Wikipedia — Ratings inflation; Chiang et al. (2024) §3 on convergence in Chatbot Arena.

Quick Check

A player's rating is much higher than their true skill. Over many games, what happens?