← Index
Engineering · 06

The AlphaZero Detour

For about three weeks this spring, Cell Division’s AI stack went all-in on AlphaZero — Monte Carlo Tree Search guided by a policy/value network, trained on self-play, no human games, no hand-crafted features in the value head. We built the whole thing: batched MCTS, parallel self-play, tournament gating, tanh-normalized value targets. Then we didn’t ship it. The Elite tier running on your phone today is a small CNN that does one forward pass per move. No tree search, no Python, no AlphaZero. This post is about the detour, what it taught us, and why none of that work was wasted even though the weights never shipped.

What AlphaZero actually is, in one paragraph

A neural net takes a board as input and produces two things: a policy — a probability over every legal move — and a value — an estimate of who’s winning, scaled to [-1, 1]. MCTS uses the policy as a prior to guide a tree search, runs a few hundred rollouts that bottom out at the value head, and picks the move whose subtree got visited the most. Self-play: the current net plays itself, MCTS makes the moves, and every game gets replayed into training as (position, visit distribution, final outcome). No human games, no opening book, no feature engineering beyond “stack of planes that encodes the board.” Cell Division’s branching factor is tiny — sixty-four cells on the largest board, much less than chess or go — and we already knew the game had enough tactical depth that a pure linear model was leaving moves on the table. It looked like a good fit.

Making self-play actually move

Naive MCTS is embarrassingly slow on a GPU. Every rollout descends the tree until it hits a leaf, then calls the net on that one position, waits for the result, and backs it up. One forward pass per rollout, with hundreds of rollouts per move and dozens of moves per game. A GPU built to chew through 256-position batches in parallel sits at single-digit utilization while MCTS feeds it one position at a time.

The fix is virtual-loss batched MCTS. Instead of descending one path at a time, we descend K paths before ever touching the net, and we penalize partial paths with a temporary “virtual loss” so that subsequent descents diverge instead of piling onto the same leaf. Once we have K leaf states queued, we evaluate them in one forward pass, revert the virtual loss, and back up the real values. Done right, it turns MCTS from a serial workload into a batched one without changing the selection semantics. Pair that with torch.multiprocessing self-play workers — each worker runs its own MCTS on a shared copy of the net and feeds a central replay queue — and the GPU finally has something to do.

Three value targets in one afternoon

The value head is the part of the net that tells MCTS whether a leaf is winning. It’s also the part that was hardest to get right. Cell Division is a margin game — the final score is a two-digit number on each side and the winner is whoever has more — so the “correct” value target isn’t obvious. Over the course of about six hours we shipped three different value-target formulations, each one in response to the last one visibly failing.

Take one: raw score difference divided by ten. Simple, unbounded, proportional to the thing we actually care about. It looked fine on paper. In practice the value head couldn’t converge: margins near terminal are noisy enough that the target kept sliding out from under the net, and unbounded values meant MCTS kept chasing imaginary twenty-point leads down lines that quietly ended at a two-point loss. Policy loss went down cleanly; value loss plateaued.

Take two: collapse everything to {-1, 0, +1} — the classical AlphaZero target. Cleaner signal, easier for the net to fit, and it matches what the paper does. We added a tanh to the value head at the same time to keep outputs in range. The problem was that Cell Division cares about the margin. The difference between a 12–4 win and a 9–7 win encodes real positional information — you want the net to prefer lines that actually dominate territory, not lines that squeak out the win and call it a day. With win/loss/tie targets the net started picking quiet, hedging moves over sharp tactical ones, because every outcome compressed to the same three numbers. It was strictly worse against the Hard opponent than take one had been.

Take three: tanh-normalized margin. Divide the score difference by a conservative upper bound on the largest possible margin for the current board size, then squash through tanh so the target lives naturally in (-1, 1). Keep the ordering information that raw margin had, keep the bounded range that the tanh value head wants, lose the noise sensitivity that unbounded margin had near terminal. This is the formulation that stuck. The value head started converging, MCTS started agreeing with the value head about which lines were good, and the net started actually beating Hard.

With AlphaZero, value targets aren’t a configuration knob. They’re part of the search dynamics. A target the net can’t fit cleanly corrupts every MCTS rollout that uses the value head as a leaf estimate, and you end up with self-play games where both sides are reinforcing each other’s confusion. Get the target wrong and no amount of training will save you.

The sign bug that hid in plain sight

The other thing take three fixed was a latent bug in MCTS selection that had been there since the first day. PUCT picks the child with the highest Q + U, where Q is the average value seen through that child and U is an exploration bonus. The subtle part is that Q is stored from the child’s to-move perspective — after the move, it’s the other player’s turn, and the child’s value naturally reflects how the opponent views the resulting position.

We were adding child Q directly at the parent. Which meant that at every selection, the parent was picking the move whose resulting position made the opponent happiest. MCTS was quietly running adversarial to itself. The fix is one character:

q = -child.q()

With the virtual-loss sign flipped to match — it had been penalizing the wrong direction for the same reason. The net had been converging before the fix, just slowly, through a self-play signal that was subtly poisoned. That’s the scariest kind of bug: it doesn’t crash, it doesn’t assert, it doesn’t even stall the loss curve. It just wastes your GPU for two days and makes you distrust every other hyperparameter while you try to figure out why the net won’t learn. After that day, the first thing we check in any tree search code is which frame every value is stored in.

Centered padding

One more thing the final rewrite fixed. Cell Division supports five board sizes — 4, 5, 6, 7, 8 — and the net has to generalize across all of them. The easy answer is to pick one canvas size (we chose 8×8) and embed every smaller board somewhere inside it. Early versions embedded smaller boards into the top-left corner. A 4×4 game lived in [0:4, 0:4] of the 8×8 canvas, and the other forty-eight cells were always empty.

Convolutional nets learn spatial priors from their inputs. If the game is always in the top-left, the conv filters learn that the top-left is where the game is, and cross-size generalization gets worse than it should. Worse, rotations and horizontal flips — the symmetries we were using for data augmentation — stopped being valid, because a rotation of a top-left-padded 4×4 board lands in a completely different region of the canvas. The data augmentation was actively corrupting the training set for smaller boards.

The fix is to center: embed an N×N board at offset (8 - N) / 2, so the 4×4 game lives in the middle, the 8×8 game fills the whole canvas, and every smaller board is surrounded by equal padding on all sides. Rotations and flips become genuine symmetries again, the conv filters stop caring where the game is on the canvas, and cross-size transfer gets measurably better. This is also when 5×5 and 7×7 joined the training mix — the odd sizes had been too broken under top-left padding to train on at all.

Why it didn’t ship

Here’s the honest part. AlphaZero at inference time isn’t one forward pass. It’s a policy/value evaluation per MCTS simulation, and even a modest sixty-four simulations per move means sixty-four net calls before the opponent gets to play. On a mid-range phone, that’s hundreds of milliseconds per move, in Python, loading a multi-megabyte ResNet checkpoint, with tree bookkeeping living in interpreted code. Cell Division’s move loop plays one move per second. A player would watch the AI think, notice the delay, and start tapping the screen.

We could have trimmed. Smaller ResNet, fewer MCTS simulations, a cheaper C++ tree. That’s a real project. It’s a six-to-eight-week project, for a mobile game with one engineer, and at the end of it the AI is still doing tree search on a phone. The cost curve just didn’t bend the right way.

The path that did work is upstream of the app entirely. A strong teacher — trained offline, on a real GPU, with all the AlphaZero infrastructure we’d just built — generates move distributions for millions of positions. A small student CNN learns to imitate those distributions directly, in a single forward pass, with no tree search at all. The student is the thing that ships. One 3×8×8 input, one ONNX forward pass out of src/engine/ai/engine.ts, no Python at runtime, no MCTS, no replay buffer. On device it’s a couple of milliseconds. The teacher’s job is to have opinions strong enough to be worth imitating — which is exactly what the AlphaZero detour left us with.

What survived

The weights didn’t ship. The engineering did. Nearly every piece of the detour shows up in the production training pipeline under a different name.

Batched MCTS with virtual loss is how the teacher-data generator stays feasible. Generating millions of teacher labels means running the teacher through millions of positions, and a serial MCTS would have turned that into a multi-week job. Reusing the batched descent from the self-play trainer kept it overnight.

Tanh-normalized margin targets are still the right target for any value regression in this game. The distilled student uses the same normalization when it learns from teacher outputs, and the fallback switching-linear model uses the same intuition when it picks among candidate moves. The other two formulations we tried are both documented dead ends now, which means nobody has to rediscover them.

Centered padding is how train_student.py embeds boards, which is the only reason one student checkpoint can play all five board sizes without a per-size head. The augmentations work, the conv filters generalize, cross-size transfer is free instead of hard-won.

Sign discipline is harder to point at, because it lives in the kind of habit you only build by getting burned once. But it’s there. Every value we pass around in the current training code has a comment naming whose perspective it’s stored in, and every new piece of search code gets a sign-convention test as its first unit test.

The lesson

A month of deep work on something that gets deleted is only a waste if the lessons delete with it. In this case the opposite happened: the production path that ships today is simpler, faster, and runs entirely on device — and every reason it works is something the AlphaZero detour paid for in advance. Batched search, bounded value targets, centered canvases, careful sign handling. The student that’s playing you on the Elite tier is not an AlphaZero network. But it couldn’t exist without one.