Chapter 9 · Part III — The craft

Thinking ahead

Everything the network does at play time is one forward pass — pure instinct. This chapter adds deliberation: simulate a handful of possible futures before committing, pick with hindsight, and then — because simulation is too slow to ship — bake the planner's judgment back into the instinct itself.

After this chapter you can explain

Why chess-style minimax is the wrong search for a simultaneous-move game, and what a stage game is instead
How the trained policy makes search affordable (top-k pruning) and how a single matrix cell gets filled (clone, joint action, fresh seed, value head)
What expert iteration is: play with the slow search teacher, distill its choices into the fast student
The real mc5→mc8 numbers — including why round one gained +22.4 points and round three gained nothing

Instinct and deliberation

By Chapter 8 you have a trained policy that plays a full game of doubles Pokémon. But look at how it plays: every decision is one forward pass. State goes in, a probability over legal actions comes out, an action is picked, done. It never asks "what happens if I do this?" It has no imagination — only reflexes, distilled from millions of self-play turns. Call that instinct.

Deliberation is the other thing a player can do: before committing to a move, actually try the candidate moves in your head, see where each one leads, and pick with the benefit of hindsight. For a program with access to the real simulator, "in your head" can be literal — simulate the candidate futures inside Showdown itself, then choose.

In plain terms

The policy is a query planner running on heuristics: it looks at the query shape and picks a plan in microseconds, using rules baked in from past workloads. Search is the planner that actually tries a few candidate plans against real statistics before executing the query. Slower per decision, but it catches the cases where the heuristic is confidently wrong. And, as you'll see, the heuristic is what makes trying plans affordable — it tells you which candidates are even worth costing out.

Why the chess-engine playbook doesn't work here

The classic game-tree algorithm is minimax: I consider my moves, then for each one I consider the opponent's replies, alternating down the tree. Chess engines are minimax plus fifty years of refinements. It's the obvious template — and it is wrong for this game, for one structural reason: in doubles Pokémon, both players move at the same time. You lock in your choice, they lock in theirs, and the turn resolves jointly. Nobody replies to anybody.

If you model the turn sequentially anyway — "I move, then they respond" — you are handing the simulated opponent information no real opponent has: sight of your move before choosing theirs. And that assumption doesn't just cost a little accuracy everywhere; it is catastrophically wrong at exactly the positions planning exists for. Protect mind-games (will they Protect this turn or attack?), Fake Out timing, Trick Room races — these are all guessing games between two players committing blind. A sequential model concludes Protect is useless (the simulated foe, seeing it coming, simply attacks the other slot) when in reality Protect is strong precisely because they can't see it coming.

The right model: each turn is a stage game — a matrix. My candidate actions are the rows, the opponent's candidate actions are the columns, and every cell is one possible joint outcome: "I click Icy Wind, they click Protect — what does the resulting position look like?" Thanks to Open Team Sheets, both full rosters are public, so the only hidden things each turn are the opponent's simultaneous choice and the dice. No guessing about what they might have — just about what they'll do. That makes this a far simpler search problem than classic hidden-team Pokémon.

Making it affordable: let the instinct prune

A full matrix is unaffordable. Each side can have hundreds of legal joint actions (two slots, each choosing among moves, targets, and switches) — the matrix could be up to 400×400, and every cell costs a full simulated turn. Even a modest fraction of that, every decision, is out of the question.

The trick — and the reason all of Part II had to come first — is policy-guided pruning. Ask the trained policy for its probabilities over both sides' actions (it can evaluate either seat), and keep only the top-k actions per side, with k around 6–12. Now the matrix is k×k: 36 to 144 simulations per decision. The instinct is what makes deliberation cheap: options the policy already knows are bad are never simulated at all. Search doesn't replace the network — it rides on it, using it twice: as the candidate generator here, and (next section) as the judge of what the simulations produce.

Inside one cell

Filling a cell means answering: "if this joint action happens, how good is the resulting position for me?" Four steps, all in src/selfplay/search.py talking to src/selfplay/bridge.mjs:

Snapshot the live battle. Showdown's engine can serialize its entire state (battle.toJSON()), and the bridge stores that snapshot under a handle.
Clone and apply. Rebuild a private copy from the snapshot and feed it the joint action — my row's choice and their column's choice, together, exactly as a real turn resolves.
Roll the dice — deliberately. The clone gets a fresh random seed, so damage rolls, secondary effects, and speed ties come out one concrete way. One such concrete playing-out of a chancy turn is called a determinization: the dice, frozen mid-air, one specific way. Average the cell over m different seeds to smooth the luck out.
Score the leaf. Don't simulate to the end of the game — that would be another fifteen turns of cost. Instead, hand the successor position to the value head, the critic from Chapter 6, which reads a position and estimates "probability I win from here." Trained as PPO's internal yardstick, it now takes a second job: leaf evaluator for the planner.

The depth-1 stage game. The policy nominates each side's top-k candidate actions; every (row, column) cell is one simulated joint turn, scored by the value head and averaged over m random seeds. Solving the finished matrix yields a mixed strategy over my rows — the bars on the left — not a single "best move."

Solving the matrix: the answer is probabilities

Once every cell has a number, what's "the best move"? Here's the part that surprises everyone raised on chess engines: against a thinking opponent in a simultaneous game, the answer is often not a single move. It's a mixed strategy — a probability over your rows.

Rock-paper-scissors makes it obvious. There is no best throw; any fixed choice is exploitable, and the optimal play is to randomize. Doubles Pokémon has the same texture in miniature every turn: if you always Protect the Pokémon they're threatening, they learn to double into the other slot; if you never Protect, they hammer you freely. The unexploitable answer might be "Protect 60%, attack 40%" — deliberately random, because the randomness itself is what denies the opponent a profitable read. Since the game is zero-sum (their gain is exactly my loss), the k×k matrix has a well-defined equilibrium, and search.py finds it with fictitious play: two imaginary players take turns best-responding to each other's accumulated history of choices, over and over, until the frequencies settle. It's approximate, dependency-free (no LP solver to install), and plenty for a k×k matrix. And note what the output is: a calibrated probability distribution over moves. That is exactly the shape the recommender wants to show a human — this mixed strategy is the product's move suggestion, not a by-product of it.

War story

Before trusting any of this, milestone 1 was an equivalence test (src/selfplay/search_check.py): replay 28 successor decisions across 5 games and demand the cloned simulation match live play bit for bit. It failed — twice, for two different reasons. First, battle.toJSON() turned out to alias the live battle's internal arrays: the "snapshot" shared memory with the source, so simulating on the clone corrupted the real game mid-search — a stat boost got applied twice. Fix: freeze each snapshot as an isolated JSON string. Second, the clone's protocol stream carried Showdown's |split| secret/public line pairs, and feeding both into the state tracker let the public rounded-%-HP line overwrite the exact HP the secret line had just set. Two silent corruptions, either of which would have made every search subtly wrong forever. The lesson generalizes to any pipeline: never trust a copy path without an equivalence test. If you clone state — a snapshot, a replica, a backfill — prove the clone behaves identically before you build on it.

Did it work? Yes — and then no

Depth-1 search over the policy was tested head-to-head with the raw policy in src/evaluate/search_bench.py, paired so both pilots face the identical matchup and seed: search-p1 won 67% of its games where the raw greedy policy won 43% over 30 matchups (20/30 vs 13/30). A one-turn think, guided and judged by the network itself, is a clear, large improvement. In the language of the theory, depth-1 stage-game search is a policy-improvement operator: given a decent value head, it provably plays at least as well as the policy it wraps.

So naturally the next move was depth 2 — value each successor not with the value head but by recursively solving its stage game one ply down. More lookahead, more strength, right? No. Depth-2 was a dud: paired argmax went 8–8 over 50 matchups, sampled play was flat, and — the telling detail — depth-2's strategies diverge from depth-1's by a total-variation distance of 0.49, meaning it picks genuinely different moves about half the time… and wins no more. Different, not better.

Why? The value head already is a depth-∞ estimate, in compressed form — it was trained on whole-game outcomes, so its score for a position implicitly integrates everything that tends to happen afterwards. A second ply of explicit simulation mostly re-derives what the leaf evaluation already knew, at 10× the cost. (There's a sharper suspect too, flagged later in docs/roadmap.md: the leaves are scored by a single Linear(128,1) layer trained as a side-objective inside PPO, and its noise is plausibly larger than the typical gap between cells — which would cap depth-2 and everything else that leans on leaf values.) Either way, the operational lesson stands: deeper is not automatically better; measure before you scale.

Key point

Search here is not a tree, it's a matrix: policy nominates top-k actions per side, the simulator fills the k×k cells, the value head scores the leaves, and a zero-sum solve returns a mixed strategy. One ply of this beat the raw policy 67–43. A second ply bought nothing.

Expert iteration: cache the planner into the weights

There's a catch with shipping search everywhere: it takes seconds per decision — fine for a recommender pondering one turn, unusable inside the training loop that needs thousands of decisions per minute. Worse, every simulated turn is another chance to trip the upstream Showdown sim-hang (Chapter 2's old enemy); search multiplies exposure to it by 50–400×. So search can't collect training data online. But its improved play can be captured offline. That's expert iteration (ExIt), and it's a plain teacher-student loop:

Generate (src/selfplay/exit_gen.py): play 150 self-play games in which every single decision is made by depth-1 search. For each decision, record not just the move played but the full searched mixed strategy — the whole probability vector. One matrix solve conveniently yields both sides' equilibrium strategies (row player and column player), so every game labels both trajectories. 150 games ≈ 2,400 labelled decisions.
Distill (src/selfplay/exit_train.py): plain supervised learning — no rewards, no PPO. Warm-start from the current policy, then fit the policy head to reproduce the searched strategies (cross-entropy loss: "make your softmax look like the teacher's distribution") and fit the value head to the actual game outcomes (mean-squared error). This is distillation: the slow expert's judgment, baked into the fast student's weights.

In plain terms

You have an exhaustive query that computes the truly correct answer but takes seconds, and a serving path that must answer in microseconds. So you run the expensive computation offline in batch, and load its answers into the fast model everyone actually queries. ExIt is exactly that: search is the batch job, the network is the serving cache, and distillation is the load step. After it, the raw one-forward-pass policy plays like the planner — with no planner attached.

The ExIt cycle. The fast policy powers the slow teacher (nominating candidates and scoring leaves); the teacher plays 150 fully-searched games; distillation fits the policy head to the searched mixed strategies and the value head to outcomes. The improved student then makes a better teacher, and the loop repeats — mc5 → mc6 → mc7.

The mc lineage: it works, it compounds, it converges

Round one distilled search-over-mc5 into mc6 (10 epochs; policy cross-entropy fell 2.68 → 1.78, value MSE 0.67 → 0.08). The most visible change was the shape of the output: mc5's action distribution was nearly flat — roughly 2% on everything, spread over ~288 joint actions — while mc6 concentrated 22 / 18 / 16% on search's preferred moves. On the paired pilot-bench against the mc2 field, sampled win-rate jumped 45.6% → 68.0%: +22.4 points, about 16 standard errors. Not subtle.

And here is Chapter 8's gotcha striking live: greedy evaluation of the same two checkpoints showed +1.1 points — pure noise. Greedy scores only the argmax, and ExIt mostly moved probability mass, not the single top pick. mc5's flat distribution meant sampling it played near-randomly (45.6%, worse than its own greedy 50.1%) — it was secretly dependent on greedy selection. Had the eval tools not been fixed to sample by default, this entire +22-point improvement would have been invisible.

Round two — teacher is now search over mc6 — produced mc7: 74.5% vs mc6's 69.1%, a clean +5.4 points (~3.8σ) on an already-sharp distribution. No flat-baseline artifact this time; expert iteration genuinely compounds. mc7 was promoted to the default checkpoint. Round three produced mc8 — and nothing: 78.3% vs mc7's 78.9% in the same run, −0.6 points, a null. The tell was in the training loss: cross-entropy had plateaued at ~1.56 across rounds two and three. The student now reproduces the depth-1 teacher's strategy almost exactly, so another round with the same teacher has nothing left to teach. Converged. mc8 was not promoted.

Three rounds of expert iteration, measured the honest way. Paired sampled win-rate against the mc2 field: mc5 45.6 → mc6 68.0 → mc7 74.5, then mc8 flat (78.3 vs mc7's 78.9 measured in the same round-3 run). CE plateaued at ~1.56 from round two — once the student reproduces the teacher, same-teacher rounds buy nothing.

Stage-game search algorithm

Depth-1 planning for simultaneous moves: top-k×top-k matrix of simulated joint turns, value-head leaves, zero-sum solve via fictitious play → a mixed strategy.

src/selfplay/search.py

Forward model mechanic

Snapshot / clone / joint-apply / fresh-seed commands added to the bridge, proven bit-for-bit equivalent to live play before use.

src/selfplay/bridge.mjs · search_check.py

Expert iteration algorithm

Generate 150 fully-searched self-play games, then distill the searched strategies (CE) and outcomes (MSE) into the fast policy. Repeat until CE plateaus.

src/selfplay/exit_gen.py · exit_train.py

Depth-2 pitfall

Twice the lookahead, zero gain: 8–8 paired argmax, strategies 0.49 TV away from depth-1's yet no stronger. The value head already integrates the future.

docs/archive/search.md

The cliffhanger

Step back and admire the arc: a policy that thinks in a single forward pass, a planner that provably improves it, and a distillation loop that pours the planner back into the forward pass. Win-rate against the benchmark field climbed 45.6 → 68.0 → 74.5 → 78.3 across the lineage. mc7 was promoted, twice-blessed by the numbers, the strongest checkpoint the project had ever produced.

It looked like a triumph. It was also, silently, a lobotomy in progress — through both of those promotions, the rising win-rate was hiding the destruction of something the numbers weren't measuring at all. The next chapter is about what mc7 could no longer do, why nobody noticed, and what got built so it can never go unnoticed again.

Check yourself

Why is minimax — the chess-engine model — wrong for this game, and where does it fail worst?

Both players commit their moves simultaneously, but minimax models turns sequentially, which hands the simulated opponent sight of your move before they choose. That misvalues exactly the spots planning is for: Protect mind-games, Fake Out timing, Trick Room races — guessing games whose value comes from the opponent not knowing your choice. The right model is a stage game: a matrix of (my action × their action), solved for a mixed strategy.

Search needs up to 400×400 simulated turns per decision unpruned. What makes it affordable, and what's the dependency?

Policy-guided pruning: the trained policy's probabilities nominate only the top-k actions per side (k≈6–12), shrinking the matrix to 36–144 cells. The dependency is the policy itself — the instinct trained in Part II is what makes deliberation cheap, because actions it rates as bad are never simulated at all. (Chapter 10 shows the dark side of that dependency.)

ExIt round one gained +22.4 points and round three gained −0.6. What happened in between?

Convergence. Round one had an inflated margin because mc5's distribution was pathologically flat (sampling it played near-randomly); round two's +5.4 was a genuine compound gain. By round three the distillation cross-entropy had plateaued at ~1.56 — the student now reproduces the depth-1 teacher's strategy, so another round with the same teacher has nothing new to teach. mc8 was not promoted.