You now know the formula and the update rule. But does the system actually work? Do ratings converge to something meaningful, or do they just bounce around forever?
The Self-Correcting Mechanism
Built-in negative feedback: If your rating is too high relative to your true skill, you'll lose more often than predicted → your rating drops. If too low, you'll win more often → rating rises. The system always pushes ratings toward a stable equilibrium.
This happens because (S - E) has a sign:
Overrated player: actual wins < expected wins → S - E is negative on average → rating decreases
Underrated player: actual wins > expected wins → S - E is positive on average → rating increases
How Many Games to Stabilize?
This depends on K and the rating pool:
Rule of thumb: With K = 32, a player's rating typically stabilizes within 20–30 games against opponents of varied strength. FIDE uses a "provisional" label for the first 30 rated games.
For a small ping pong ladder (8–12 people), expect ratings to feel "right" after everyone has played roughly 15–20 matches each.
The Starting Rating Problem
Everyone has to start somewhere. Common approaches:
Fixed start (e.g., everyone begins at 1500). Simple, but early games produce wild swings as the system figures out who's actually good.
Provisional period with high K. New players use K = 40–64 for their first N games, then drop to K = 32. This lets them find their level quickly without permanently distorting others' ratings.
Placement matches. Play several games before assigning a rating. Chatbot Arena does this — a new model is matched broadly at first to quickly estimate its strength.
For your ping pong ladder: Start everyone at 1500 with K = 40 for the first 10 games, then switch to K = 32. Ratings will feel reasonable after 2–3 weeks of regular play.
When Elo Doesn't Stabilize Well
Watch out for these failure modes:
Too few games — ratings haven't had enough data to converge. Common in small pools.
Non-transitive matchups — A beats B, B beats C, C beats A (like rock-paper-scissors). Elo assumes a single linear skill dimension and can't represent this.
Skill changes over time — if players improve rapidly, a low K won't keep up. This is why FIDE uses higher K for juniors.
Rating deflation/inflation — in closed pools, if new players always start at 1500 and then lose, they "donate" points to the pool, inflating everyone else. Or they leave with points, deflating the pool.
A player's rating is much higher than their true skill. Over many games, what happens?
In a 10-person ping pong ladder, what's the main risk of starting everyone at 1500 with K = 10?
Recommended Reading
The Yi Zhu blog post on Chatbot Arena Elo covers how LMSYS handles convergence and bootstrapping for new models being added to an existing leaderboard.