Chapter 10 · Part III — The craft

When winning lies

The strongest checkpoint the project had ever produced could no longer play half the game — and the win-rate that promoted it, twice, never flinched. This chapter is the project's hardest-won lesson: how a rising number hid a strategy lobotomy, the exact mechanism that caused it, and the regression gate built so it can never happen silently again.

After this chapter you can explain
  • How expert iteration destroyed the entire class of delayed-payoff moves — and why the win-rate rose anyway
  • The two-part mechanism: a one-turn teacher blind to delayed payoffs, plus the top-k ratchet that makes the damage one-way
  • What the regression gate asserts (engagement floors vs safety checks) and why the two kinds have opposite risk profiles
  • Why mc9 was promoted at 61% over mc7's 78% — and why the fix that finally beat mc7 was a wider teacher, not a deeper one

The discovery

Picture the moment. Chapter 9 ended in triumph: mc7, twice-distilled from the search teacher, winning 78.3% against the benchmark field, freshly promoted to the default checkpoint. Someone hands it the corpus's most dedicated Trick Room team — a team whose entire game plan is to fire Trick Room and let its slow bruisers move first for four turns. And mc7… never clicks Trick Room. Not rarely. Never. Its measured propensity for the move was 0.8% — on teams built around nothing else. Piloting the heaviest Trick Room team in the corpus (PH0KDQR5VE, labelled weight 7), it set the move zero times.

Now the part that should make you sit forward: this didn't slip past one promotion. It slipped past two — mc6 and mc7 — while the win-rate went up through both. Every number anyone looked at said the policy was getting better. It was simultaneously getting better at the number and losing an entire dimension of the game.

What was lost, precisely

When src/evaluate/policy_regression.py was pointed at the lineage, the damage turned out to be nothing like one forgotten move. Measured as the mean probability the policy puts on a move across all decisions where it's legal:

move (class)mc5 (pre-ExIt)mc7 (ExIt ×2, then default)
trick_room — speed control12–13%0.8% (dead)
protect — scout / stall14.5%3.9% (dead, on nearly every team)
tailwind — speed control16%7–8%
fake_out — tempo / flinch22%7–10% (halved)

Look at what these four moves have in common: none of them deals meaningful damage this turn. Trick Room and Tailwind pay off over the following turns; Protect trades this turn for information and safety; Fake Out trades damage for tempo. Expert iteration hadn't forgotten a move — it was systematically converting the policy into a hit-hard bot, grinding down the entire class of delayed-payoff utility moves. Every pre-ExIt checkpoint passes these checks; the collapse is a clean step-change at exactly mc5→mc6, the moment training switched from PBRS-shaped reinforcement learning to search distillation.

5 10 15 20 mean P on move when legal (%) 12 0.8 14.5 3.9 16 8 22 7 floor 4% trick_room protect tailwind fake_out mc5 (pre-ExIt) mc7 (ExIt ×2)
Not one move — a class. Mean per-decision probability on each move where legal, mc5 vs mc7 (from docs/regression.md). Every delayed-payoff utility move collapsed together; Trick Room and Protect fell below the gate floors (dashed) that would later be defined — 4% for the wincon moves, 7% for Protect, 5% for Fake Out.

The mechanism, part one: a teacher blind to delayed payoffs

Recall Chapter 9's teacher: depth-1 search simulates exactly one turn, then hands the resulting position to the value head. Now walk a Trick Room decision through that machine. You click Trick Room: this turn, your side deals no damage, and probably takes some. The cell in the stage game's matrix looks like a wasted turn. Yes, the value head at the leaf gives some credit for "Trick Room is up and my team is slow" — but that thin, one-position credit is competing against rows where a simulated attack visibly deleted half a foe's HP. The payoff is delayed, so the move is undervalued.

Protect fails in the mirror-image direction. Click Protect and the simulated turn looks wonderful — incoming damage blocked, nothing lost. What the one-turn horizon can't see is Protect's cost: the free turn you just handed the opponent to set up, reposition, or double into your partner. The cost is delayed, so the move is overvalued — and then, because the zero-sum solve treats a predictable Protect as exploitable tempo, its equilibrium share erodes anyway. The one-turn teacher mis-prices the whole delayed class, in both directions.

This was confirmed at the source, not inferred: run the teacher itself — search over mc5, the checkpoint that still plays Trick Room at 12–13% — and the searched strategy assigns Trick Room 0.0%, even though the value head demonstrably still credits a "TR-is-up and I'm slow" position. The teacher was corrupting a healthy student. And nothing in the distillation loop protects the class: src/selfplay/exit_train.py is plain supervised learning on the teacher's strategies — it never sees the PBRS shaping reward Φ that originally taught these moves in Chapter 7. The signal that created the behavior was simply absent from the pipeline that overwrote it.

The mechanism, part two: the ratchet

Undervaluing a move once would be survivable — the next round could recover it. What made the collapse one-way is a structural interlock you already know from Chapter 9: search only evaluates the actions the policy's own prior nominates into the top-k. Follow the spiral. Round one's teacher underweights Trick Room; distillation dutifully lowers the student's prior on it. Round two's teacher uses that student as its candidate generator — and now Trick Room's prior is below the top-k cut, so the move never even enters the matrix. It isn't evaluated and rejected; it's never evaluated at all. The searched strategy gives it exactly zero, distillation pushes the prior lower still, and around it goes. The same pruning that made search affordable makes the damage irreversible.

In plain terms

It's a cache that only refreshes entries people are already hitting. An entry cools off slightly, drops out of the hot set, stops being refreshed, goes stale, gets hit even less — and expires forever. No single refresh cycle decided to evict it; the eviction is an emergent property of "popularity decides what gets re-examined." The policy's prior is the hit counter, top-k is the hot set, and Trick Room aged out of the cache.

policy prior on Trick Room drops teacher underweights it once falls below the top-k cut no longer nominated as a row never simulated again searched strategy: exactly 0% distilled prior lower still next round's candidate generator one-way 12–13% → 0.8% across two rounds
The ratchet. Policy-guided pruning — the very trick that made search affordable — closes the loop: a move whose prior falls below the top-k cut is never simulated again, so each distillation round can only push it lower. Trick Room went from 12–13% to 0.8% with no round ever explicitly deciding against it.

Naming the law

This failure has a name, and you've met it professionally: Goodhart's lawwhen a measure becomes a target, it stops being a good measure. Win-rate against a particular benchmark field was the promotion criterion, so the optimization process found the cheapest path to that number: fast generic damage beats a handful of archetypes in aggregate, so the ability to actually play those archetypes was sold off to buy it. Nothing measured the sale.

In plain terms

You've seen this pipeline before: a team optimizes a single aggregate KPI — say, average query latency — with no invariant checks, and ships a "faster" release that quietly dropped correctness on a category of queries nobody's dashboard sliced by. The aggregate went up; a whole class of behavior died. The fix in data engineering is never "stare harder at the KPI." It's schema checks, data-quality assertions, invariants that fail the build. That is exactly what got built next.

Key point

A scalar win-rate cannot see strategy collapse. mc7 won 78.3% and could not play Trick Room, Protect properly, or half of speed control — and the number that promoted it rose through the entire regression. If a behavior matters, something must assert it directly.

A regression gate for behavior

The response is src/evaluate/policy_regression.py (spec: docs/regression.md): a regression gate that every candidate checkpoint must pass before promotion — exactly like the schema and data-quality tests guarding a pipeline deploy, but the thing under test is a policy's behavior. It runs sampled games (how the policy actually deploys — Chapter 8's lesson) and asserts three different kinds of things.

Engagement floors: "sometimes" — never "optimally"

For each utility move, the gate selects teams whose declared win condition is that move — drawn from the corpus's team labels, e.g. Trick Room teams labelled at weight ≥5, where the strategy is unarguably the plan — and asserts the policy's mean probability on the move, over all decisions where it's legal, clears a floor: trick_room ≥4%, tailwind ≥4%, protect ≥7% on seed-drawn games, fake_out ≥5% (a looser guard, because Fake Out is only legal on entry, so it yields few decisions per game).

Two design choices deserve your attention. First, the metric: mean per-decision probability, averaged over hundreds of decisions, is a low-variance statistic — unlike per-game "did it ever click it," which is far too noisy to gate on near a threshold. Second, the floors are deliberately low. Healthy mc5 sits at 12–16% on these moves; the floors are 4–7%. They don't enforce "plays it well" or even "plays it often" — they catch exactly one pathology: never. And be honest about what that means: an engagement floor encodes an opinion about how the game should be played, and it slightly cages the model on purpose. The project's stated position: a strategy the corpus says is a team's entire plan may not be trained to zero probability, full stop — even if some future optimizer thinks that's fine.

Alongside these sits a mega-evolution guard: on seed-drawn games where mega is available, use it in ≥80% of them. This one isn't a repair — ExIt actually improved mega usage (mc5 95% of games, 49% turn-1 confidence → mc7 100% and 88%) — it's insurance against a future regression on a near-free-value mechanic.

Safety assertions: a different species entirely

The gate's second class checks plays that are wrong under any theory of the game — not opinions, correctness. The flagship: putting probability mass on a single-target damaging move that is type-immune against every foe on the field (Close Combat into a Ghost-only field — the move lands on no one). Note the opposite risk profile: a safety assertion can never cage a strong policy, because no improvement ever wants these plays — a stronger policy passes it for free. Engagement floors can, in principle, hold a policy back; safety assertions cannot. That's why safety checks can be added freely, while every new engagement floor is a deliberate editorial decision.

The immune-hit finding was humbling in its own right: it wasn't a regression at all but a longstanding blind spot — every checkpoint failed it, mc5 at 31% mass, mc9 at 42%, because the sparse reward never punished a do-nothing move in a rare situation. And the fix was beautiful, because it needed zero retraining: a fully-immune single-target attack is strictly dominated, so Policy._legal_mask in src/selfplay/trainer.py simply removes it from the legal set before the masked softmax — Chapter 5's masking machinery, called back into service. mc9's 42% dropped to 0% with unchanged weights, the masked-off mass renormalizing onto useful moves. Contrast the second safety check, ally-KO (Earthquake that KOs your own healthy partner): it stays report-only, because it isn't strictly dominated — sometimes the double-KO trade is genuinely worth it — so it can't be masked, only discouraged by training. Same species of check, different enforcement, because only one of them is wrong always.

War story

The gate's first version had a subtler flaw: it was flaky. The same checkpoint's trick_room decision count swung from 109 to 156 between two runs — the harness seeded numpy but not torch, and batch_act samples the opponent's moves through the torch RNG. Near a 4% floor, a pass or fail was partly luck. The fix (seed both; the gate is now byte-identical across runs) matters more than it looks: a flaky gate is worse than no gate. Once a blocking check fails spuriously, people learn to re-run it until it passes — and then it guards nothing. Any data engineer who has watched a flaky pipeline test get politely ignored knows exactly how this movie ends.

Engagement floor metric

Mean policy probability on a team's declared wincon move where legal, over dedicated teams. Floors 4–7% — set to catch "never," not to enforce "optimal." Deliberately cages the model, on purpose.

src/evaluate/policy_regression.py

Safety assertion mechanic

Plays wrong under any theory (immune-hit). Can't cage improvement — a stronger policy passes for free. Immune-hit fixed by masking, no retrain; ally-KO stays report-only because the trade is sometimes correct.

src/selfplay/trainer.py · Policy._legal_mask

The ratchet pitfall

Distillation lowers a prior → the move falls below search's top-k cut → never simulated → distilled lower. The pruning that makes search cheap makes the collapse one-way.

src/selfplay/exit_gen.py

Reward-anchored ExIt algorithm

On decisions where a utility move is legal, pin the distillation target to the RL policy's own prior — the one that learned the class correctly via PBRS. Search still teaches everything else.

src/selfplay/exit_train.py

The fix arc: strategy first, then buy the strength back

Fixing the policy took three swings, and the numbers tell the story better than adjectives.

Swing one: reward-anchored ExIt → mc9. Since the teacher mis-prices the delayed class but the pre-ExIt RL policy priced it correctly (it learned from thousands of PBRS-shaped episodes), the fix anchors the distillation target: on any decision where a utility move is legal, the target is pinned to the RL policy's own prior instead of the searched strategy; search still teaches every other decision — the immediate tactics it gets right. With deliberately light distillation (3 epochs; more erodes the rare Trick Room decisions), the result was mc9: passes every assertion — TR 4.8%, protect 13.0%, tailwind 15.7%, fake_out 19.1%, mega 85% — and sets Trick Room on turn 2 of the heavy-TR team mc7 played dead. The cost: 61.1% against the benchmark field, versus mc7's 78.3%.

And it was promoted anyway. Sit with that decision, because it's the thesis of this chapter: the project deliberately replaced its strongest-ever checkpoint with one seventeen points weaker on the number, on the argument that a VGC policy that never sets Trick Room is worse than a lower win-rate against a weak field. The number was demoted from objective to evidence. If you take one sentence away from this course, that trade is a good candidate.

Swing two: distillation tuning — a wall. Multi-round anchoring against a fixed mc5 reference, then a KL-leash (a penalty tethering the student's distribution to mc5's, allowing more aggressive epochs) produced gate-passing checkpoints with the best strategy retention yet (kl-leash: TR 10.3%) — but essentially tied on strength. The diagnosis, once found, is obvious in hindsight: a narrow teacher (k=6, m=1) searching over an already-distilled policy is no longer meaningfully smarter than its student. Distilling it harder just photocopies the photocopy.

Swing three: a wider teacher — the lever that worked. Not deeper (Chapter 9 buried depth-2), wider: more candidate actions per side and more determinization seeds per cell, i.e. a bigger matrix filled more carefully. exit_gen --k 10 --m 3 over mc5 produced policy_wide: gate-passing, +3.6 points on pilot-bench, Elo 260 vs the incumbent's 253. Pushing further, --k 12 --m 5 produced policy_wider: the first gate-passing checkpoint to beat raw mc7 — Elo 309 vs 301, pilot-bench 84.5%. It is the deployed default. The strength ceiling had never been search depth; it was search budget.

The frontier

But the wider teacher isn't free. It pulls harder on immediate tactics, which erodes exactly the class the gate protects: wide sat at TR 4.9%, wider at 4.1% — right on the 4% floor. Strength and strategy retention trade against each other along a frontier, and this lever has been pushed to its edge: wider still, and the next checkpoint fails the gate. The standing options for the next round of strength are to anchor against a stronger reference than mc5 (protecting Trick Room at a higher baseline), or to bring in new RL signal rather than squeezing distillation again.

gate fails: TR < 4% 200 240 280 320 Elo (random anchored at 0) 0 2 4 6 8 10 12 Trick Room propensity (mean P when legal, %) floor 4% the frontier: retention trades against strength mc7 · 301 · fails gate wider · 309 · deployed wide · 244 mc9 · ~220 kl-leash · ~253
The strength/strategy frontier. Elo vs Trick Room propensity. mc7 is strong but lives in the gate-fail region (TR 0.8%); mc9 restored the strategy at a steep strength cost; kl-leash bought the best retention (10.3%) but no strength; the wider teacher finally beat mc7 at Elo 309 — sitting at TR 4.1%, right on the floor. The lever is spent: any wider and the next checkpoint fails the gate.

The honest coda: what the gate is not

It would be a tidy ending to say the gate now guarantees the bot "plays Trick Room well." It guarantees nothing of the sort, and the project went and proved that about its own tooling. src/evaluate/archetype_bench.py — a two-axis probe measuring the policy both against each archetype and piloting each archetype — delivered three deflating findings. The deployed policy's gap to mc7 was uniform, ~+12 points across every archetype — including random teams: general strength, not a Trick-Room-shaped hole. Piloting committed archetype teams, it wins ~48% regardless of wincon — a dedicated TR team wins no more than a random pile. And the sharpest one: forcing the policy to actually set Trick Room every game moved its TR-team win-rate not at all — 45% versus 46%. Setting Trick Room is not the same thing as winning with Trick Room; the deficit is whole-plan piloting, not clicking the setup move.

So hold the gate in its correct mental slot: it is a guard, not a strength meter. Passing it means "this checkpoint still engages the plan and doesn't blunder" — never "this checkpoint plays the plan well," and never "this checkpoint is stronger." Strength is the other axis, measured by Elo and pilot-bench as in Chapter 8, and docs/evaluate.md states the promotion rule that closes the loop this chapter opened: both axes, always. No number promotes alone anymore.

Key point

The final settlement: a behavioral gate that catches "never," strength benchmarks that catch "weaker," and a standing acknowledgment that neither catches "plays it badly." mc7 → mc9 → wider is the whole lesson in three checkpoints — the metric lied, an invariant caught it, and strength was rebuilt inside the invariant.

Check yourself
Win-rate rose through the mc6 and mc7 promotions while Trick Room died. How is that arithmetically possible?

Win-rate was measured in aggregate against a benchmark field. Generic fast damage beats most of that field, so the checkpoint got genuinely better at the measured number while losing the ability to execute a class of strategies the aggregate barely samples and never isolates. That's Goodhart's law: the measure became the target and stopped measuring what mattered. Only a direct behavioral assertion — mean probability on the wincon move when legal — could see the loss.

Why can safety assertions be added to the gate freely, while every engagement floor is a deliberate, slightly risky decision?

A safety assertion (like immune-hit) flags a play that is wrong under any theory of the game, so no improving policy ever wants it — it can never hold a stronger model back, and the strictly-dominated case can even be fixed by masking with zero retraining. An engagement floor encodes an opinion about how the game should be played, and could in principle cage a genuinely better policy that plays differently. The project accepts that cage on purpose, but each new floor is an editorial act, not a free check.

mc9 passed the gate at 61.1% vs mc7's 78.3%, and was promoted. What eventually recovered the strength, and why did that lever work when distillation tuning didn't?

A wider depth-1 teacher: exit_gen with k=12 candidates per side and m=5 seeds per cell produced policy_wider — the first gate-passing checkpoint to beat raw mc7 (Elo 309 vs 301). Distillation tuning had hit a wall because a narrow teacher (k=6, m=1) over an already-distilled policy is no smarter than its student — there was nothing left to distill. The ceiling was search budget, not depth; a bigger, better-averaged matrix made the teacher genuinely stronger again. The cost: wider sits at TR 4.1%, right on the 4% floor — that lever is now spent.