You can label your four AI tiers Easy, Medium, Hard, and Elite. That doesn’t make the ladder real. Until you can show, numerically, that Hard beats Medium and Medium beats Easy across every board size you ship, the ladder is just a pile of marketing. This post is about the round-robin tournament we run to prove it, the two heatmaps it produces, and the things the heatmaps revealed the first time we actually looked at them.
The claim, and the cost of being wrong
The promise a difficulty selector makes is monotone progress. A player beats Easy, graduates to Medium, hits a wall on Hard, and eventually touches Elite. If any of those steps is too small, the tier is filler — people skip it. If any step is too large, the tier is a brick wall — people churn. A badly-graduated ladder is a retention problem wearing an AI-difficulty costume.
Playtesting catches the extremes and nothing else. Your own hands know the game too well, and a handful of games from friends gives you a dozen samples against a combinatorial matrix: four tiers, five board sizes, two first-player assignments. Forty cells, maybe thirty data points. You can convince yourself of almost anything with that much noise.
What “graduated” actually means
For the ladder to hold together, two claims have to be true at the same time:
- Win rate strictly increases as you move up the ladder. Hard beats Medium more than half the time, and by more than Medium beats Easy.
- Score differential grows, not just flips sign. A ladder where Hard wins by half a cell is a fragile ladder even if the sign is right — any small rule change upends it.
Both claims have to survive across board sizes and across first-player assignments. A difficulty ladder that only holds up when the AI goes first is not a ladder, it’s a coin flip.
The script
The tournament lives at ai/scripts/tournament_heatmap.py. It builds the shipped roster from build_players() in ai/src/tournament/tournament.py, inserts a non-shipped Blended Hard probe between Medium and Hard (more on that in a minute), and then runs an all-pairs round robin for every board size in [4, 5, 6, 7, 8] with both first-player assignments. The inner loop is pleasantly dumb:
for board_size in [4, 5, 6, 7, 8]:
for fp in ["Row first", "Row second"]:
for i in range(n):
for j in range(n):
for _ in range(n_games):
result = play_game(ais[i], ais[j], board_size)
diff = result.scores[0] - result.scores[1]
if diff > 0: wins += 1
elif diff == 0: draws += 1Fifty games per matchup is the default. With six players, five board sizes, and two first-player columns, that’s 9,000 games per run — cheap enough to run on a laptop in a few minutes because the engine is pure Python and the AIs are mostly 14-feature linear scorers.
Why two charts, not one
Score difference alone lies. An AI that wins by +2.3 points on average can still be losing 40% of its games; the big wins drag the mean around. Win rate alone also lies. A 55% win rate with 40% draws is a completely different game from a 55% win rate with 5% draws — the first says “both AIs are playing it safe and occasionally one slips,” the second says “one AI is clearly stronger.” You need both. So the script produces two PNGs: a red-blue score-difference heatmap and a green-yellow-red win/draw-rate heatmap. Each cell in the win-rate chart shows two numbers stacked: the row’s win percentage over the column on top, and the draw percentage below, suffixed with d.
If you can only look at one number to judge your ladder, you’re going to ship a broken ladder. Score and win rate disagree constantly, and the disagreements are where the interesting bugs live.
The RNG trap
The first time we ran the tournament, close matchups flipped between runs. Medium would beat Easy 54% one afternoon and 48% the next. The underlying bug was in the game state, not the tournament: the seed fed to every softmax-sampling AI was Math.random(), freshly generated whenever a new game started. Every tournament run was sampling a different random stream, so every tournament run was a different measurement. The one-line fix:
// src/engine/game/GameManager.ts
- aiSeed: Math.random(),
+ aiSeed: 42,With the seed pinned, two runs of the tournament produce bit-identical heatmaps. That turns the chart from a sample into an assertion: if a change to an AI shifts a cell in the heatmap, the change is responsible — not variance. It also means we can compare a before-heatmap and an after-heatmap and trust the diff.
What the heatmap revealed about Medium
The old Medium heuristic, in both the TypeScript game engine and the Python tournament roster, was immediate_ai + immediate_opp at softmax temperature 0.5 — a soft, sampled argmax over “which cell would score me the most right now, minus what it would give the opponent if they took it next.” It was stochastic on purpose, to keep play from feeling robotic.
The heatmap said: it’s too stochastic, and it’s missing the one feature that actually matters off the immediate step. Medium was beating Easy only narrowly on 4×4, and on larger boards the margin barely grew — exactly the “Medium feels like Easy with extra steps” failure mode. The fix was two lines in src/engine/ai/engine.ts:
- scores[i] = feats[i][0] + feats[i][1]; // immediate AI + opp
+ scores[i] = feats[i][0] + feats[i][1] + 0.3 * feats[i][2]; // + openness
...
- return cells[softmaxSample(scores, 0.5, seed)];
+ return cells[softmaxSample(scores, 0, seed)];Adding the openness feature at weight 0.3 gives the AI a reason to prefer cells that still have room to grow — not a trained coefficient, just a tuned constant that matches the PythonMediumAI in ai/src/ai/opponents.py. Dropping temperature to zero turns the soft sample into a deterministic argmax. After the change, the heatmap showed the gap opening up: new Medium beats Easy decisively, and still loses to Hard on every board size. The ladder was straighter, but not yet evenly spaced.
Blended Hard: probing the gap
The remaining problem was the size of the Medium-to-Hard step. Hard is trained with PPO and argmax-greedy; Medium is a three-feature hand-tuned heuristic. Even with the retuned Medium, the gap was big enough that a player who had just started winning on Medium got flattened by Hard on their first try. We didn’t want to dumb Hard down — it’s the last inline heuristic before Elite, and it needs to feel like work.
Instead, we built an analytical probe: a convex combination of the Medium heuristic weights and the trained Hard weights. The whole class is twelve lines in ai/src/ai/opponents.py:
medium_w[0] = 1.0 # immediate_ai
medium_w[1] = 1.0 # immediate_opp
medium_w[2] = 0.3 # openness
hard_w = np.asarray(hard_weights).ravel()
blended = (1.0 - blend) * medium_w + blend * hard_wWith blend = 0.25, Blended Hard is 75% Medium and 25% trained Hard. It slots cleanly between the two in the heatmap, so we can see the gap smoothly instead of as a cliff. Blended Hard is not a shipped tier — players never see it in the difficulty picker. It’s the kind of thing you only build because you’re measuring, and it earns its keep by showing you the shape of the gap between two things you do ship.
Reading the plots
Here are both heatmaps as the script produces them. Each figure is a 5 × 2 grid: five board sizes stacked vertically, two first-player assignments side by side. Every cell is a full 6 × 6 matrix of [Easy, Medium, Blended Hard, Hard, Elite (linear), Elite (CNN)] against itself.


A few things the heatmaps made obvious the moment we printed them:
- Board size is a knob on discriminating power. On 4×4 the whole ladder compresses — there are so few playable cells that even Easy stumbles into competent moves — and tiers blur into each other. On 8×8 the ladder is clean and monotone. If you want a test that separates tiers, test on the biggest board.
- The diagonal carries signal. Self-play with a fixed seed should produce draws for deterministic AIs. The fact that Hard vs. Hard and Elite vs. Elite are mostly draws is evidence that the seeding is doing its job. If the diagonal ever drifts, we have a reproducibility bug before we have a difficulty bug.
- First-player matters, and it matters unevenly. Some tiers have a meaningful first-move advantage; others barely change. A ladder that only holds up in one column is a ladder that plays differently depending on whether the human picks white or black. Showing both columns makes that visible instead of averaged away.
- Blended Hard lives where we wanted it. The Medium → Blended Hard → Hard gradient is smooth in both charts. Which means the gap we saw before wasn’t a training artifact — there really is a lot of room between hand-tuned heuristic and PPO-trained greedy — and it gave us a handle on how to shrink the gap if we ever wanted to ship an intermediate tier.
What we didn’t ship (yet)
The tournament infrastructure can do more than we currently ask of it. run_round_robin in ai/src/tournament/tournament.py already computes Elo ratings across the full roster — we just don’t render them, because a single Elo number per tier smooths over exactly the board-size and first-player structure the heatmap exists to show. Someday a line chart of Elo vs. board size per tier would make a nice companion. There’s also nothing stopping us from running the tournament in CI on every AI change and failing the build if a cell moves by more than some threshold; the pinned seed makes that a realistic test, not a flake generator.
The thing we most want and don’t yet have is per-move disagreement analysis between Hard and Elite. A heatmap tells you Elite wins; it doesn’t tell youwhere Elite plays differently. That’s the next rung of this measurement stack.
Takeaway
If your game ships a difficulty ladder, you owe your players proof that the rungs are spaced. Round-robin heatmaps are the smallest artifact that actually delivers that proof: cheap to run, reproducible once you pin the seed, and hard to misinterpret when you produce both a score chart and a win-rate chart side by side. Building one cost us a day. The things it revealed the first afternoon we ran it — the noisy Medium, the big Medium-to-Hard step, the reproducibility bug hiding in aMath.random() call — would have cost a lot more than that to find in the wild, if we ever found them at all.
The feature vocabulary used throughout this post — immediate_ai, openness, ai_connectivity, and the rest — is documented in Building the Cell Division AI. To re-run the tournament yourself, see ai/README.md §Evaluation and Analysis and invoke python scripts/tournament_heatmap.py --games 50.