Chapter 11 · Part IV — The payoff

The product

The goal was never a bot that wins on a simulator. It was advice a human can use: which four to bring, who leads, what to click first. This chapter walks through the three deliverables that came out the other end, the one problem the system still has (told honestly), and what it would cost to move all of this to a new metagame.

After this chapter you can explain
  • What the recommender actually outputs, and why its probabilities are sampled, not greedy
  • Why the draft is currently foe-blind, and how preview features are fixing it
  • What transfers for free to a new metagame — and why the vocab tables are a surrogate-key sin
  • Which parts of this system came from you, and which came from the literature

Full circle: the deliverable is advice

Chapter 1 defined success as a recommender, and that is what shipped. Run src/tools/recommend.py --you <code> --foe <code> with your six Pokémon and the opponent's six, and the policy reads out its own probability distribution over all 90 pick-4-plus-lead options — the same 90-way team-preview action it learned to make in training. It then commits to its top pick and reports two more things: the turn-1 move distribution for those leads, and the value head's estimate of the position — the network's honest "who's winning" number before a single move is clicked.

Every probability you see is the policy's own masked softmax, with sampled semantics rather than greedy argmax. That is Chapter 8's hard-won lesson baked directly into the product: greedy readouts undersell stochastic setup lines. A Trick Room team's plan might carry 30% probability rather than being the single top action — greedy would hide it entirely, and once did hide a whole checkpoint's improvement. The recommender shows you the distribution, so the setup lines surface.

One flag upgrades the whole readout: --search replaces the raw softmax with the depth-1 planner's solved mixed strategy — the probabilities after thinking one full turn ahead against every plausible reply.

In plain terms

The recommender is EXPLAIN ANALYZE for a matchup. You don't just get the plan the optimizer chose — you get the candidate plans it considered, with its own cost estimates attached, plus a bottom-line estimate of how the whole query will go. And --search is like asking the optimizer to actually execute a sample of each plan before ranking them.

YOUR 6 --you RA2M50KGNF a 6-mon team code FOE'S 6 --foe PH0KDQR5VE visible via team sheets THE POLICY pointer net + value head masked softmax, sampled --search: depth-1 mix PICK-4 + LEADS 90-way distribution, e.g. 1,3 | 2,5 → 24% 1,2 | 3,4 → 17% … TURN-1 MOVES for the committed leads, e.g. Fake Out + Protect → 19% VALUE ESTIMATE "who's winning" scalar, e.g. +0.31 (your favor)
The recommender's contract. Two 6-mon team codes go in; the policy's own probability distributions come out — its pick-4 + leads over all 90 options, the turn-1 move mix for the committed leads, and the value head's position estimate. Percentages shown are illustrative of the output format, not measured values. src/tools/recommend.py.

Watching strong play — and playing against it

The second deliverable answers "what does good play of this matchup look like?" src/tools/showdown.py --t1 <code> --t2 <code> runs a best-of-N self-play series between two teams and dumps every game to showdowns/NNNN.log (the raw battle trace) plus a readable .md recap. Think of it as commissioning two strong players to pilot your matchup while you take notes. It also takes --search for planned play.

The third deliverable is playing the thing yourself. src/tools/play_match.py and src/tools/play_drive.py put you in the driver's seat against the deployed checkpoint, and --search gives the bot depth-1 planning at decision time. That flag matters beyond strength: the raw policy's game logs occasionally show 1-ply blunders — both slots clicking Trick Room on the same turn, which self-cancels — and one turn of lookahead refuses those outright, because the planner sees the second Trick Room undoing the first.

The bonus product: the policy as a team-evaluation oracle

Something unplanned fell out of having a strong policy: you can search team space with it. src/tools/evolve_teams.py and src/tools/find_best.py use the policy as the pilot on both sides and ask, for each team in the corpus, "how does this team do against the field when both sides are played well?" The output lives in best.txt: the ten best corpus teams ranked by win-rate against the field, topped by vgcpastes:KPWTACK35G at 70.6% over 60 opponents.

Notice what happened there. The policy was trained to answer "what should I click?" — and became reusable as an evaluation function for a completely different question, "which team is strong?" A good learned evaluator is infrastructure: once you have it, whole new query patterns open up on top of it, the way a well-built dimensional model ends up answering questions nobody wrote requirements for.

The honest open problem: the draft is foe-blind

Now the part a sales deck would omit. The recommender's headline feature — the pick-4 — currently barely looks at the opponent.

The probe that proved it is src/evaluate/preview_stability.py: take 10 teams, show each of them 40 different foes, and record the argmax draft. Result: 10 out of 10 teams commit the identical pick-4 and identical leads against every single foe. The distribution isn't degenerate — mean top-probability 0.30, entropy 0.54, a broad head — but the foe never reorders it. The tie-break is constant.

War story

The cost shows up in the human-play logs (logs/claude/). The policy led Kingambit into a fully visible Intimidate + Fake Out Scrafty — the textbook counter-lead, sitting right there on the open team sheet. Another game, it led Kingambit + Charizard into a pair that hit both for 4× damage. A human glances at the foe's six and adjusts. The policy, at draft time, effectively couldn't.

Here is the interesting part: the root cause is architectural, not a training shortfall. In-battle actions carry a rich feature row — KO fractions, type effectiveness, priority. Preview actions carried all-zero action features. At draft time the scorer could see the foe only through the 128-dimensional state embedding bottleneck, and the whole-game credit signal — one ±1 twenty decisions later — was far too weak to force foe-sensitivity through that straw. The network wasn't refusing to look at the foe. It had almost nothing to look with.

The fix in flight is preview features: with PREVIEW_FEATS=1, the bridge fills those zero rows with matchup summaries it already knew how to compute — per-lead type-effectiveness on offense and defense against the foe's six, the fraction of the foe's team each lead outspeeds, and whether either side carries Fake Out or Intimidate. The first controlled ablation is encouraging: at matched training depth, the featured run adapts its pick-4 on 5 of 10 teams versus 0 of 10 for an identical run without the features — so the foe-sensitivity is attributable to the features, not the recipe.

For the curious

The ablation also taught a second lesson. The first version overlaid the new preview features on the same 30 dimensions as the battle features, and early learning lagged about 70 iterations behind baseline — preview gradients and battle gradients were fighting over the same scorer weights, two signals interfering in one set of parameters. The fix: widen action rows to 60 dims with disjoint blocks (battle 0–29 untouched, preview 30–59). Same information, separate columns, no contention. Any data engineer who has watched two writers contend for one hot partition will recognize the shape of both the bug and the fix.

This split created a compatibility problem the project solved the way you would: schema versioning. Checkpoints now carry metadata — {state_dict, preview_feats, afeat} — and the evaluation harness feeds each policy exactly the inputs it trained on, even when an old and a new policy face each other inside the same battle. One consumer reads the v1 view, one reads v2, same underlying stream. docs/evaluate.md calls these "feature eras."

What a new metagame costs

Every February, the format rotates: new Pokémon, new items, sometimes new rules. What happens to all of this? docs/new-rules.md works it out, and it is the most data-engineering chapter in the repo.

What transfers for free: the entire oracle and feature layer. KO fractions, type effectiveness, effective speed — all computed at decision time from the game data and damage calculator, not learned. A brand-new Pokémon holding a brand-new item gets correct action features on day one, provided Showdown ships the format. As the doc puts it: the network never had to learn the game's rules, only preferences over them. The data pipeline is likewise self-healing — new team DBs drop into data/, the validation cache auto-invalidates (it's keyed by format string), pools rebuild. State featurization is format-agnostic too.

What breaks: the vocab tables. And the way they break deserves a moment, because you have seen this failure mode before.

In plain terms

featurize.py builds its species/item/ability/move ID tables by sorting the corpus alphabetically at import time and numbering the result. That is a surrogate key regenerated from a mutable natural ordering — the textbook sin. Insert one new row and every key after it alphabetically shifts, silently remapping which embedding row means which Pokémon. It's the same disaster as rebuilding a dimension table's surrogate keys on every load: all your fact-table references still point somewhere, just no longer at the right thing.

In practice the failure is loud before it can be silent: new entries also change the vocab size, so the embedding matrices change shape and old checkpoints fail to load outright. But the consequence is the same — today, a metagame change means a fresh network and the full training recipe from scratch: ~120 iterations of self-play RL (cheap, about 18 seconds per iteration), then the expert-iteration rounds that produce a wider-class checkpoint, then the gate and Elo. Not one round of fine-tuning.

The designed fix is exactly the migration you would write: persist the vocab as an append-only JSON artifact. Existing IDs frozen forever; new entries appended at the end; on checkpoint load with a grown vocab, copy the old embedding rows into the resized matrix and freshly initialize only the new tail. Then a meta shift becomes a warm start plus a short fine-tune — the returning 90% of the roster keeps everything the network learned about it.

TODAY — rebuilt alphabetically before 1 Amoonguss 2 Basculegion 3 Charizard 4 Incineroar after +Baxcalibur 1 Amoonguss 2 Basculegion 3 Baxcalibur 4 Charizard 5 Incineroar ids 3→4, 4→5: rows scrambled matrix shape changes old checkpoint fails to load THE FIX — append-only JSON after +Baxcalibur 1 Amoonguss 2 Basculegion 3 Charizard 4 Incineroar 5 Baxcalibur (appended) embedding matrix copied copied copied copied fresh init old ids frozen forever warm start + short fine-tune
The vocab migration. Today (left), featurize.py rebuilds IDs by sorting the corpus alphabetically: inserting Baxcalibur shifts Charizard and Incineroar down, scrambling which embedding row means which Pokémon — and the reshaped matrix fails to load, forcing retraining from scratch. The designed fix (right): an append-only JSON vocab freezes old IDs forever, new entries get the next ID, old embedding rows are copied into the resized matrix and only the new tail is freshly initialized.

One graceful-degradation note worth knowing: an old policy that meets an unknown species today does not crash. Unknowns map to ID 0 — the "no identity" padding row — so the policy plays on oracle features alone. It knows the new Pokémon's damage math perfectly and its identity not at all. Mediocre, not broken.

Key point

The learned weights are the perishable asset; the oracle layer is the durable one. Everything computed from rules transfers to a new metagame for free. Everything learned from the corpus is hostage to the ID tables — and freezing those IDs is what turns "retrain from zero" into "warm start."

What's next

The roadmap's open threads, briefly. Seeding the training league with committed-archetype opponents — teams forced to actually execute Trick Room or Tailwind — so self-play can't collapse into one shared style. Improving leaf-value quality, since the critic is what caps everything search-shaped. Running expert iteration on the preview decision itself: it is one 90-way choice per game, so it can be searched directly by rolling out fast games for the top picks. Extending the gate with more floors (redirect, screens). And the cheapest wins of all — play-time knobs that need no retraining: run the depth-1 search wider at inference than the k=8, m=2 the play tools default to (the wide-teacher result proved budget was the ceiling), and sample from the searched mixture rather than playing its argmax, because a deterministic bot is a bot a human learns to exploit.

What you actually built

One more thing, and it's about you.

"I built a reinforcement-learning system without being a data scientist" sounds like it needs an asterisk. Look at where the load-bearing pieces of this system actually came from.

You watched the policy Earthquake its own partner, asked why, and the answer became the allyHit feature. You noticed it never mega-evolved and the answer became the +mega action variant. You lost a game leading Kingambit into a visible Scrafty and the question "why didn't it see that coming?" became the entire preview-features program — including the probe that quantified the problem before anyone wrote a fix. And you asked "but does it still play Trick Room?" of a checkpoint whose win-rate said everything was fine — and that question became the regression gate, the measurement philosophy of Chapter 10, and the reason the deployed policy today is one that passes it.

The pattern is consistent: the ideas came from watching the system play and refusing to accept "the number went up" as an answer. The algorithmsPPO, GAE, pointer networks, expert iteration — came from the literature, the same place everyone gets them. Of those two lists, the second is the learnable one. It has textbooks, it has names, and as of this chapter, you've read a course on it. The first list — noticing that a winning policy has stopped playing its own win condition — doesn't come from a textbook. It came from you, and it's the part of the system nobody could have downloaded.

Check yourself
Why does the recommender report sampled probabilities instead of the single best (greedy) action?

Because the policy encodes stochastic setup lines — like Trick Room — as probability mass, not as the argmax. Greedy readouts hide them entirely (greedy argmax once hid a whole checkpoint's +22-point gain), so the product shows the full masked-softmax distribution, and --search shows the planner's solved mixed strategy.

The draft was foe-blind even though the network could "see" the foe's team. What was the actual bottleneck?

Preview actions carried all-zero action features, so the only path from the foe's roster to the draft score was the 128-dim state embedding bottleneck — and the whole-game credit signal was too weak to force foe-sensitivity through it. Filling those rows with matchup summaries (PREVIEW_FEATS=1) made 5 of 10 teams adapt their pick where 0 of 10 did before.

A new Pokémon is added to the format. What survives, and what forces retraining?

The oracle/feature layer survives untouched — KO fractions, type effectiveness, and speed are computed from the rules at decision time. What breaks is the vocab: IDs are rebuilt by sorting the corpus alphabetically, so one insertion shifts every later ID and reshapes the embedding matrices, and old checkpoints fail to load. The fix is an append-only persisted vocab with copied embedding rows — turning a from-scratch retrain into a warm start.