Chapter 5 · Part II — The learner

The brain

Chapter 4 turned a battle into numbers. This chapter is about the neural network that reads those numbers and picks a move — and about the one design constraint that dictated its entire shape: the menu of legal actions changes every single turn.

After this chapter you can explain
  • Why a normal fixed-output classifier can't play this game, and how a pointer network sidesteps the problem
  • What the masked softmax does — and why deleting a provably-bad action from the menu beat retraining
  • What the value head is, and why this tiny extra output ends up mattering in three later chapters
  • Why one "action" covers both Pokémon at once, and why this whole thing runs faster on a CPU than a GPU

The problem: there is no fixed keyboard

Most neural networks you've heard of are classifiers with a fixed set of outputs. An image model has one output per label: cat, dog, truck. The output layer is a fixed-width table — the same columns every time, forever. Training means learning which column to light up.

This game refuses to fit that shape. On one turn the action space — the set of legal choices — might be 3 forced options. Two turns later it might be 250 combinations of moves, targets, and switches. At team preview it's exactly 90 draft choices. Worse, the choices have no stable identity: "action #17" on turn 3 might be Fake Out into the left foe, and on turn 7 it might be switch to Rillaboom. A fixed output column for "action #17" would be learning nothing coherent.

In plain terms

A classifier is a table with a fixed schema: same columns, every row. What this game hands you is a query result whose row count and row contents change every call. You can't score that with a fixed-width output table. You need a scoring function you apply per row — SELECT action, score(context, action) FROM legal_actions — where score is the same learned function no matter how many rows come back.

The answer: a pointer network

The network in src/selfplay/trainer.py (class Policy) is a pointer network: instead of owning a fixed keyboard of outputs, it points at items on a menu. It works in three steps, every decision.

Step 1 — Encode the situation into one vector

Everything the net knows about the current position gets concatenated into one long input: the 114-dim state vector from Chapter 4, plus the identity embeddings — 16 species embeddings (the 4 active Pokémon plus both full 6-mon rosters, which Open Team Sheets reveals at preview), and the 4 active mons' item and ability embeddings. That whole bundle (434 numbers) runs through two 128-wide layers and comes out as a single 128-dim vector called h.

Think of h as the position, digested: one compact summary of "who's on the field, who's healthy, who's fast, what's the weather, what's on both benches." It's computed once per decision and reused for everything that follows.

Step 2 — Score each candidate action independently

Now the menu arrives: the bridge hands over every legal action, each carrying its own feature row — a 30-dim block describing the two slots' choices (is it a move or a switch, base power, type effectiveness, priority, PP, the calc oracle's "does this KO?" bits, ally-hit risk, flinch chance, mega flag, charge-turn flag), plus that action's own embeddings: up to 4 species ids (switch targets, or the 4 drafted mons at preview) and 2 move ids.

For each candidate, the net concatenates h with that action's features and embeddings and pushes the bundle through a small scorer (a 64-wide layer, then a single output). Out comes one number per action: a raw score. Same scorer, same weights, applied to 3 rows or 250 rows — it doesn't care how long the menu is.

Step 3 — Softmax over the legal menu only

The raw scores go through a masked softmax: exponentiate each score, divide by the sum, and you get probabilities that sum to 1 over exactly the current menu. A score of 1.8 next to a 0.9 and a −0.4 becomes 66% / 27% / 7%. That probability distribution is the policy's answer: during training it samples from it; at play time it can take the argmax.

state vector 114 floats (ch. 4) 16 species embeddings 4 active + both rosters item + ability embeddings 4 items · 4 abilities ENCODER 434 → 128 → 128 two Tanh layers h 128-d VALUE HEAD Linear(128→1) → “am I winning?” Fake Out→foe-a + Protect Icy Wind→foes + U-turn→foe-b team 1356 (preview: pick-4+leads) 1.8 0.9 same scorer one shared SCORER per row: (h + 30 feats + embs) → score MASKED SOFTMAX → probabilities 66% 27% 7% · everything else …up to 400 rows on a busy turn, exactly 90 at team preview — the shape of the net never changes. The value head reads only h — no action features. It predicts the game result from this position alone, and becomes the baseline for GAE (ch. 7) and the leaf evaluator for search (ch. 9).
The whole forward pass. Inputs are digested into one 128-dim summary h; a single shared scorer is applied per candidate action (a team-preview pick routes through the same scorer on its own turn); a masked softmax turns scores into a probability menu. Scores and percentages here are illustrative — the architecture and dimensions are the real ones from trainer.py.
Key point

The network never owns a fixed set of outputs. It owns two learned functions — "digest the position" and "score one candidate given the digest" — and applies the second one per menu item. That's why the same net can handle 3 forced options, 250 joint battle actions, and the 90-way draft at preview without changing shape.

For the curious

The action feature row is actually 60-wide today: dims 0–29 are the battle features described above, and dims 30–59 are team-preview matchup features, kept deliberately disjoint. An earlier version shared the dims — and the preview gradients (noisy, 90-way, fired once per game) directly interfered with the battle scorer's most load-bearing weights (the KO features), costing about 70 iterations of early learning progress. Separate columns for separate concerns, like not letting a batch backfill job write to the columns your live queries depend on. See the comment above AFEAT in src/selfplay/trainer.py.

The mask: illegal actions never exist

Before the softmax runs, a boolean mask marks which rows are real. Padded rows (the menu is padded to the batch's longest menu) and anything not offered by the simulator get their score overwritten with −10⁹ — effectively negative infinity — so the softmax assigns them probability zero. The net never wastes a single gradient learning "don't click what you can't click." Legality is enforced by construction, not learned by punishment.

And then the deeper trick: deleting a legal action

Masking illegal actions is table stakes. The interesting move — documented in docs/regression.md — was masking an action that is perfectly legal but provably never right.

War story

A regression audit found the current default policy, mc9, putting 42% of its probability mass on joint actions that click a single-target damaging move which is type-immune against every foe — Close Combat into a Ghost, Earthquake into an all-Flying field. The move lands on no one. It is a do-nothing click, strictly worse than anything else on the menu. And it wasn't a regression: every checkpoint back to pre-search mc5 (31%) had the same blind spot, because the situation is rare and the reward signal never punished it hard enough to train away.

The fix wasn't more training. Policy._legal_mask in trainer.py now drops any such action from the legal set at inference time, exactly like an illegal one (checking it never empties a menu — a switch or pass is always non-immune). Result: mc9's mass on immune hits went 42% → 0% with zero retraining, and the deleted probability renormalized onto useful moves — which even nudged the strategy floors up. The check that makes typeEff==0 trustworthy also had to learn one exception: Scrappy lets Normal and Fighting moves hit Ghosts, so bridge.mjs accounts for it.

In plain terms

This is input validation versus anomaly detection. You could train a model to learn that malformed records are bad — or you could add a constraint and make malformed records impossible. If you can prove an action is never correct, delete it from the menu. Don't spend training budget hoping the model discovers a rule you already know. And because future self-play collects data through the same mask, the net gradually stops wanting the move at all — the constraint propagates backward into the weights for free.

The value head: a second opinion off the same trunk

Branching off the same 128-dim h is one more output: a single Linear(128 → 1) called the value head. It predicts the expected game result from this position — roughly, a number near +1 when the side to move is winning and near −1 when it's losing. One trunk, two heads: the policy head answers "what should I do?", the value head answers "how am I doing?".

It looks like an afterthought — 129 extra parameters — but it becomes load-bearing three times over:

One action, two Pokémon

In doubles, you command two Pokémon per turn. The net does not pick for each slot separately — one "action" is a joint choice for both active slots at once. The bridge enumerates the Cartesian product of each slot's options (every move with every legal target, ally-target variants, a +mega variant on a mega-capable slot, every legal switch), de-duplicates switch targets, caps at one mega per turn, and caps the whole menu at 400.

Why joint? Because doubles is a coordination game. Focus-firing both attacks into one foe to guarantee a KO, or Fake-Outing the threat so your partner can set up safely — the value of slot A's choice depends entirely on slot B's. A net that scored the slots independently couldn't represent "these two clicks are great together." Scoring the pair as one unit lets it learn combinations — including the anti-combination it took an explicit allyHit feature to teach: don't Earthquake your own partner.

Team preview slots into the same machinery as just another decision node: 90 options (which 4 of your 6, and which 2 lead), each scored by the same scorer using the 4 chosen mons' species embeddings. That is why, as of Step C, the policy is the draft recommender — the pick-4 was never a separate model, just a bigger menu.

slot A ↓ · slot B → Rock Slide→foes Protect switch→Rillaboom Fake Out→foe-a Icy Wind→foes +mega switch→Rillaboom Sucker Punch→foe-b joint action focus-fire safe: flinch + cover joint action joint action joint action joint action joint action joint action ✕ deduped: same switch target joint action joint action joint action Every surviving cell is one row on the scorer's menu. Full product with all targets/variants: capped at 400, one mega per turn.
One action = one cell of this grid. Slot A's choices × slot B's choices, minus dedupes and caps. Scoring cells (pairs) instead of rows and columns separately is what lets the net learn coordination — Fake Out to buy the partner a free Protect-less turn, and never both switching into the same bench mon.

Worked example: the masked softmax in numbers

Here is the whole trick on four candidate actions. The scorer emits raw scores; the mask deletes the immune hit before the softmax ever sees it; the rest renormalize.

CANDIDATE ACTION SCORE e^SCORE PROBABILITY Rock Slide→foes + Protect 1.8 6.05 66% Fake Out→foe-a + Spore→foe-b 0.9 2.46 27% switch→Rillaboom + Protect -0.4 0.67 7% Close Combat→foe-a (Ghost) + Protect 1.2 ✕ masked: typeEff = 0 The masked action's score never enters the sum (6.05 + 2.46 + 0.67 = 9.18); the three survivors split 100%. Note the mask fires even though the net scored the immune hit 1.2 — high.
Masked softmax, worked. Exponentiate the surviving scores, divide each by their sum. The immune hit into a Ghost is legal in the game but strictly dominated, so _legal_mask deletes it before the softmax — its 1.2 score, which the net genuinely believed in, simply stops mattering. (Scores illustrative; the mechanism and the mask rule are the real ones.)

The delightful part: it all runs on a CPU

Everything above — four embedding tables, two 128-wide encoder layers, a 64-wide scorer, a 1-wide value head — is a genuinely tiny network. It trains and plays on a plain CPU, and the docs are explicit that a GPU is actually slower here. The bottleneck was never the math inside the net; it's the environment. Every decision requires Showdown to simulate a turn of Pokémon in a Node process, and that dwarfs the network's microseconds of arithmetic. Batches are small (48 games' worth of decisions per round), so shipping tensors to a GPU and back costs more than the compute it saves.

Key point

"ML needs GPUs" is a claim about big models on big batches. When the simulator is the cost and the model is small, the CPU wins. Profile the pipeline, not the folklore — the same instinct that tells you not to spin up a Spark cluster for a 200 MB join.

Where this leaves us

You now have the full anatomy: an encoder that digests the position into h, a scorer that points at menu items, a mask that makes bad clicks impossible, and a value head quietly estimating who's winning. What you don't have yet is any idea how the weights inside those layers get good. That's the next chapter: the training loop, PPO, and the spectacular collapse that taught this project how self-play goes wrong.

Pointer network algorithm

Score each item of a variable-length menu with one shared function of (position summary, item), instead of owning fixed output columns.

src/selfplay/trainer.py · Policy

Masked softmax mechanic

Softmax over legal rows only; illegal and provably-dominated rows get −10⁹ and vanish. Legality is enforced, never learned.

trainer.py · forward / _legal_mask

Value head algorithm

Linear(128→1) off the trunk: "expected result from here." Baseline for GAE (ch. 7), leaf evaluator for search (ch. 9), and the ceiling on both.

trainer.py · Policy.value / batch_value

Joint action mechanic

One action = both slots' choices at once — the Cartesian product, deduped, one mega/turn, capped at 400. Coordination becomes learnable.

src/selfplay/bridge.mjs
Check yourself
Why can't this policy be a normal classifier with one output neuron per action?

Because the action set changes every turn — 3 options one decision, up to 400 the next, 90 at preview — and slot #17 means a different thing each time. A fixed output column would have no stable concept to learn. The pointer network instead learns one scoring function and applies it per menu item, so the menu can be any length.

The immune-hit fix improved the deployed policy from 42% wasted mass to 0% without touching the weights. How?

Policy._legal_mask removes any single-target damaging move that is type-immune against every foe from the legal set at inference, exactly like an illegal action. The softmax then renormalizes the surviving actions, so the deleted probability flows onto useful moves. The weights are unchanged; only the menu shrank. If you can prove an action is never right, delete it — don't retrain and hope.

The value head is 129 parameters. Name two jobs it does beyond "nice to know."

It is the baseline GAE uses to assign credit across a game's ~20 decisions (Chapter 7), and it is the leaf evaluator depth-1 search calls to judge positions one turn ahead (Chapter 9). Per roadmap §3, its noise is likely what capped how deep and wide search could usefully go — the smallest head carries the largest downstream load.