← Index
Engineering · 08

Tournament Heatmaps: Is the Difficulty Ladder Graduated?

You can label your four AI tiers Easy, Medium, Hard, and Elite. That doesn’t make the ladder real. Until you can show, numerically, that Hard beats Medium and Medium beats Easy across every board size you ship, the ladder is just a pile of marketing. This post is about the round-robin tournament we run to prove it, the two heatmaps it produces, and the things the heatmaps revealed the first time we actually looked at them.

The claim, and the cost of being wrong

The promise a difficulty selector makes is monotone progress. A player beats Easy, graduates to Medium, hits a wall on Hard, and eventually touches Elite. If any of those steps is too small, the tier is filler — people skip it. If any step is too large, the tier is a brick wall — people churn. A badly-graduated ladder is a retention problem wearing an AI-difficulty costume.

Playtesting catches the extremes and nothing else. Your own hands know the game too well, and a handful of games from friends gives you a dozen samples against a combinatorial matrix: four tiers, five board sizes, two first-player assignments. Forty cells, maybe thirty data points. You can convince yourself of almost anything with that much noise.

What “graduated” actually means

For the ladder to hold together, two claims have to be true at the same time:

Both claims have to survive across board sizes and across first-player assignments. A difficulty ladder that only holds up when the AI goes first is not a ladder, it’s a coin flip.

The script

The tournament lives at ai/scripts/tournament_heatmap.py. It builds the shipped roster from build_players() in ai/src/tournament/tournament.py, inserts a non-shipped Blended Hard probe between Medium and Hard (more on that in a minute), and then runs an all-pairs round robin for every board size in [4, 5, 6, 7, 8] with both first-player assignments. The inner loop is pleasantly dumb:

for board_size in [4, 5, 6, 7, 8]:
    for fp in ["Row first", "Row second"]:
        for i in range(n):
            for j in range(n):
                for _ in range(n_games):
                    result = play_game(ais[i], ais[j], board_size)
                    diff = result.scores[0] - result.scores[1]
                    if diff > 0: wins += 1
                    elif diff == 0: draws += 1

Fifty games per matchup is the default. With six players, five board sizes, and two first-player columns, that’s 9,000 games per run — cheap enough to run on a laptop in a few minutes because the engine is pure Python and the AIs are mostly 14-feature linear scorers.

Why two charts, not one

Score difference alone lies. An AI that wins by +2.3 points on average can still be losing 40% of its games; the big wins drag the mean around. Win rate alone also lies. A 55% win rate with 40% draws is a completely different game from a 55% win rate with 5% draws — the first says “both AIs are playing it safe and occasionally one slips,” the second says “one AI is clearly stronger.” You need both. So the script produces two PNGs: a red-blue score-difference heatmap and a green-yellow-red win/draw-rate heatmap. Each cell in the win-rate chart shows two numbers stacked: the row’s win percentage over the column on top, and the draw percentage below, suffixed with d.

If you can only look at one number to judge your ladder, you’re going to ship a broken ladder. Score and win rate disagree constantly, and the disagreements are where the interesting bugs live.

The RNG trap

The first time we ran the tournament, close matchups flipped between runs. Medium would beat Easy 54% one afternoon and 48% the next. The underlying bug was in the game state, not the tournament: the seed fed to every softmax-sampling AI was Math.random(), freshly generated whenever a new game started. Every tournament run was sampling a different random stream, so every tournament run was a different measurement. The one-line fix:

// src/engine/game/GameManager.ts
-    aiSeed: Math.random(),
+    aiSeed: 42,

With the seed pinned, two runs of the tournament produce bit-identical heatmaps. That turns the chart from a sample into an assertion: if a change to an AI shifts a cell in the heatmap, the change is responsible — not variance. It also means we can compare a before-heatmap and an after-heatmap and trust the diff.

What the heatmap revealed about Medium

The old Medium heuristic, in both the TypeScript game engine and the Python tournament roster, was immediate_ai + immediate_opp at softmax temperature 0.5 — a soft, sampled argmax over “which cell would score me the most right now, minus what it would give the opponent if they took it next.” It was stochastic on purpose, to keep play from feeling robotic.

The heatmap said: it’s too stochastic, and it’s missing the one feature that actually matters off the immediate step. Medium was beating Easy only narrowly on 4×4, and on larger boards the margin barely grew — exactly the “Medium feels like Easy with extra steps” failure mode. The fix was two lines in src/engine/ai/engine.ts:

-    scores[i] = feats[i][0] + feats[i][1];            // immediate AI + opp
+    scores[i] = feats[i][0] + feats[i][1] + 0.3 * feats[i][2]; // + openness
...
-  return cells[softmaxSample(scores, 0.5, seed)];
+  return cells[softmaxSample(scores, 0, seed)];

Adding the openness feature at weight 0.3 gives the AI a reason to prefer cells that still have room to grow — not a trained coefficient, just a tuned constant that matches the PythonMediumAI in ai/src/ai/opponents.py. Dropping temperature to zero turns the soft sample into a deterministic argmax. After the change, the heatmap showed the gap opening up: new Medium beats Easy decisively, and still loses to Hard on every board size. The ladder was straighter, but not yet evenly spaced.

Blended Hard: probing the gap

The remaining problem was the size of the Medium-to-Hard step. Hard is trained with PPO and argmax-greedy; Medium is a three-feature hand-tuned heuristic. Even with the retuned Medium, the gap was big enough that a player who had just started winning on Medium got flattened by Hard on their first try. We didn’t want to dumb Hard down — it’s the last inline heuristic before Elite, and it needs to feel like work.

Instead, we built an analytical probe: a convex combination of the Medium heuristic weights and the trained Hard weights. The whole class is twelve lines in ai/src/ai/opponents.py:

medium_w[0] = 1.0   # immediate_ai
medium_w[1] = 1.0   # immediate_opp
medium_w[2] = 0.3   # openness
hard_w = np.asarray(hard_weights).ravel()
blended = (1.0 - blend) * medium_w + blend * hard_w

With blend = 0.25, Blended Hard is 75% Medium and 25% trained Hard. It slots cleanly between the two in the heatmap, so we can see the gap smoothly instead of as a cliff. Blended Hard is not a shipped tier — players never see it in the difficulty picker. It’s the kind of thing you only build because you’re measuring, and it earns its keep by showing you the shape of the gap between two things you do ship.

Reading the plots

Here are both heatmaps as the script produces them. Each figure is a 5 × 2 grid: five board sizes stacked vertically, two first-player assignments side by side. Every cell is a full 6 × 6 matrix of [Easy, Medium, Blended Hard, Hard, Elite (linear), Elite (CNN)] against itself.

Score difference heatmap across five board sizes and two first-player assignments
Score difference: row AI mean score minus column AI mean score. Red means the row AI scores higher, blue means the column AI scores higher. The bottom-right corner of every sub-grid is consistently red — Elite (CNN) scores higher than everyone. The top-left is light because Easy barely scores against itself.
Win rate and draw rate heatmap across five board sizes and two first-player assignments
Win rate (top number) and draw rate (bottom number with d suffix). Green means the row AI wins more, red means it loses more. On the diagonal, identical-AI matchups with a fixed seed produce near-100% draws for the deterministic tiers (Hard, Elite, Blended Hard) and noisier outcomes for the stochastic ones (Easy) — a quick sanity check that the seeding is doing its job.

A few things the heatmaps made obvious the moment we printed them:

What we didn’t ship (yet)

The tournament infrastructure can do more than we currently ask of it. run_round_robin in ai/src/tournament/tournament.py already computes Elo ratings across the full roster — we just don’t render them, because a single Elo number per tier smooths over exactly the board-size and first-player structure the heatmap exists to show. Someday a line chart of Elo vs. board size per tier would make a nice companion. There’s also nothing stopping us from running the tournament in CI on every AI change and failing the build if a cell moves by more than some threshold; the pinned seed makes that a realistic test, not a flake generator.

The thing we most want and don’t yet have is per-move disagreement analysis between Hard and Elite. A heatmap tells you Elite wins; it doesn’t tell youwhere Elite plays differently. That’s the next rung of this measurement stack.

Takeaway

If your game ships a difficulty ladder, you owe your players proof that the rungs are spaced. Round-robin heatmaps are the smallest artifact that actually delivers that proof: cheap to run, reproducible once you pin the seed, and hard to misinterpret when you produce both a score chart and a win-rate chart side by side. Building one cost us a day. The things it revealed the first afternoon we ran it — the noisy Medium, the big Medium-to-Hard step, the reproducibility bug hiding in aMath.random() call — would have cost a lot more than that to find in the wild, if we ever found them at all.

The feature vocabulary used throughout this post — immediate_ai, openness, ai_connectivity, and the rest — is documented in Building the Cell Division AI. To re-run the tournament yourself, see ai/README.md §Evaluation and Analysis and invoke python scripts/tournament_heatmap.py --games 50.