Chapter 5 · Part II — The learner
The brain
Chapter 4 turned a battle into numbers. This chapter is about the neural network that reads those numbers and picks a move — and about the one design constraint that dictated its entire shape: the menu of legal actions changes every single turn.
- Why a normal fixed-output classifier can't play this game, and how a pointer network sidesteps the problem
- What the masked softmax does — and why deleting a provably-bad action from the menu beat retraining
- What the value head is, and why this tiny extra output ends up mattering in three later chapters
- Why one "action" covers both Pokémon at once, and why this whole thing runs faster on a CPU than a GPU
The problem: there is no fixed keyboard
Most neural networks you've heard of are classifiers with a fixed set of outputs. An image model has one output per label: cat, dog, truck. The output layer is a fixed-width table — the same columns every time, forever. Training means learning which column to light up.
This game refuses to fit that shape. On one turn the action space — the set of legal choices — might be 3 forced options. Two turns later it might be 250 combinations of moves, targets, and switches. At team preview it's exactly 90 draft choices. Worse, the choices have no stable identity: "action #17" on turn 3 might be Fake Out into the left foe, and on turn 7 it might be switch to Rillaboom. A fixed output column for "action #17" would be learning nothing coherent.
A classifier is a table with a fixed schema: same columns, every row. What this game hands you is a query result whose row count and row contents change every call. You can't score that with a fixed-width output table. You need a scoring function you apply per row — SELECT action, score(context, action) FROM legal_actions — where score is the same learned function no matter how many rows come back.
The answer: a pointer network
The network in src/selfplay/trainer.py (class Policy) is a pointer network: instead of owning a fixed keyboard of outputs, it points at items on a menu. It works in three steps, every decision.
Step 1 — Encode the situation into one vector
Everything the net knows about the current position gets concatenated into one long input: the 114-dim state vector from Chapter 4, plus the identity embeddings — 16 species embeddings (the 4 active Pokémon plus both full 6-mon rosters, which Open Team Sheets reveals at preview), and the 4 active mons' item and ability embeddings. That whole bundle (434 numbers) runs through two 128-wide layers and comes out as a single 128-dim vector called h.
Think of h as the position, digested: one compact summary of "who's on the field, who's healthy, who's fast, what's the weather, what's on both benches." It's computed once per decision and reused for everything that follows.
Step 2 — Score each candidate action independently
Now the menu arrives: the bridge hands over every legal action, each carrying its own feature row — a 30-dim block describing the two slots' choices (is it a move or a switch, base power, type effectiveness, priority, PP, the calc oracle's "does this KO?" bits, ally-hit risk, flinch chance, mega flag, charge-turn flag), plus that action's own embeddings: up to 4 species ids (switch targets, or the 4 drafted mons at preview) and 2 move ids.
For each candidate, the net concatenates h with that action's features and embeddings and pushes the bundle through a small scorer (a 64-wide layer, then a single output). Out comes one number per action: a raw score. Same scorer, same weights, applied to 3 rows or 250 rows — it doesn't care how long the menu is.
Step 3 — Softmax over the legal menu only
The raw scores go through a masked softmax: exponentiate each score, divide by the sum, and you get probabilities that sum to 1 over exactly the current menu. A score of 1.8 next to a 0.9 and a −0.4 becomes 66% / 27% / 7%. That probability distribution is the policy's answer: during training it samples from it; at play time it can take the argmax.
h; a single shared scorer is applied per candidate action (a team-preview pick routes through the same scorer on its own turn); a masked softmax turns scores into a probability menu. Scores and percentages here are illustrative — the architecture and dimensions are the real ones from trainer.py.The network never owns a fixed set of outputs. It owns two learned functions — "digest the position" and "score one candidate given the digest" — and applies the second one per menu item. That's why the same net can handle 3 forced options, 250 joint battle actions, and the 90-way draft at preview without changing shape.
The action feature row is actually 60-wide today: dims 0–29 are the battle features described above, and dims 30–59 are team-preview matchup features, kept deliberately disjoint. An earlier version shared the dims — and the preview gradients (noisy, 90-way, fired once per game) directly interfered with the battle scorer's most load-bearing weights (the KO features), costing about 70 iterations of early learning progress. Separate columns for separate concerns, like not letting a batch backfill job write to the columns your live queries depend on. See the comment above AFEAT in src/selfplay/trainer.py.
The mask: illegal actions never exist
Before the softmax runs, a boolean mask marks which rows are real. Padded rows (the menu is padded to the batch's longest menu) and anything not offered by the simulator get their score overwritten with −10⁹ — effectively negative infinity — so the softmax assigns them probability zero. The net never wastes a single gradient learning "don't click what you can't click." Legality is enforced by construction, not learned by punishment.
And then the deeper trick: deleting a legal action
Masking illegal actions is table stakes. The interesting move — documented in docs/regression.md — was masking an action that is perfectly legal but provably never right.
A regression audit found the current default policy, mc9, putting 42% of its probability mass on joint actions that click a single-target damaging move which is type-immune against every foe — Close Combat into a Ghost, Earthquake into an all-Flying field. The move lands on no one. It is a do-nothing click, strictly worse than anything else on the menu. And it wasn't a regression: every checkpoint back to pre-search mc5 (31%) had the same blind spot, because the situation is rare and the reward signal never punished it hard enough to train away.
The fix wasn't more training. Policy._legal_mask in trainer.py now drops any such action from the legal set at inference time, exactly like an illegal one (checking it never empties a menu — a switch or pass is always non-immune). Result: mc9's mass on immune hits went 42% → 0% with zero retraining, and the deleted probability renormalized onto useful moves — which even nudged the strategy floors up. The check that makes typeEff==0 trustworthy also had to learn one exception: Scrappy lets Normal and Fighting moves hit Ghosts, so bridge.mjs accounts for it.
This is input validation versus anomaly detection. You could train a model to learn that malformed records are bad — or you could add a constraint and make malformed records impossible. If you can prove an action is never correct, delete it from the menu. Don't spend training budget hoping the model discovers a rule you already know. And because future self-play collects data through the same mask, the net gradually stops wanting the move at all — the constraint propagates backward into the weights for free.
The value head: a second opinion off the same trunk
Branching off the same 128-dim h is one more output: a single Linear(128 → 1) called the value head. It predicts the expected game result from this position — roughly, a number near +1 when the side to move is winning and near −1 when it's losing. One trunk, two heads: the policy head answers "what should I do?", the value head answers "how am I doing?".
It looks like an afterthought — 129 extra parameters — but it becomes load-bearing three times over:
- In Chapter 7 it's the baseline that GAE uses to decide which of a game's ~20 decisions deserved the win.
- In Chapter 9 it's the leaf evaluator for depth-1 search: search plays out one turn and asks the value head "who's ahead now?" at every leaf.
- And its quality is a ceiling: the project's roadmap (§3, "Leaf-value quality is what capped search depth") concludes that this one small head — trained only as a side objective inside PPO — is probably what limited deeper search, wider search, and teacher quality all at once. The whole search program leans on 129 numbers.
One action, two Pokémon
In doubles, you command two Pokémon per turn. The net does not pick for each slot separately — one "action" is a joint choice for both active slots at once. The bridge enumerates the Cartesian product of each slot's options (every move with every legal target, ally-target variants, a +mega variant on a mega-capable slot, every legal switch), de-duplicates switch targets, caps at one mega per turn, and caps the whole menu at 400.
Why joint? Because doubles is a coordination game. Focus-firing both attacks into one foe to guarantee a KO, or Fake-Outing the threat so your partner can set up safely — the value of slot A's choice depends entirely on slot B's. A net that scored the slots independently couldn't represent "these two clicks are great together." Scoring the pair as one unit lets it learn combinations — including the anti-combination it took an explicit allyHit feature to teach: don't Earthquake your own partner.
Team preview slots into the same machinery as just another decision node: 90 options (which 4 of your 6, and which 2 lead), each scored by the same scorer using the 4 chosen mons' species embeddings. That is why, as of Step C, the policy is the draft recommender — the pick-4 was never a separate model, just a bigger menu.
Worked example: the masked softmax in numbers
Here is the whole trick on four candidate actions. The scorer emits raw scores; the mask deletes the immune hit before the softmax ever sees it; the rest renormalize.
_legal_mask deletes it before the softmax — its 1.2 score, which the net genuinely believed in, simply stops mattering. (Scores illustrative; the mechanism and the mask rule are the real ones.)The delightful part: it all runs on a CPU
Everything above — four embedding tables, two 128-wide encoder layers, a 64-wide scorer, a 1-wide value head — is a genuinely tiny network. It trains and plays on a plain CPU, and the docs are explicit that a GPU is actually slower here. The bottleneck was never the math inside the net; it's the environment. Every decision requires Showdown to simulate a turn of Pokémon in a Node process, and that dwarfs the network's microseconds of arithmetic. Batches are small (48 games' worth of decisions per round), so shipping tensors to a GPU and back costs more than the compute it saves.
"ML needs GPUs" is a claim about big models on big batches. When the simulator is the cost and the model is small, the CPU wins. Profile the pipeline, not the folklore — the same instinct that tells you not to spin up a Spark cluster for a 200 MB join.
Where this leaves us
You now have the full anatomy: an encoder that digests the position into h, a scorer that points at menu items, a mask that makes bad clicks impossible, and a value head quietly estimating who's winning. What you don't have yet is any idea how the weights inside those layers get good. That's the next chapter: the training loop, PPO, and the spectacular collapse that taught this project how self-play goes wrong.
Pointer network algorithm
Score each item of a variable-length menu with one shared function of (position summary, item), instead of owning fixed output columns.
src/selfplay/trainer.py · PolicyMasked softmax mechanic
Softmax over legal rows only; illegal and provably-dominated rows get −10⁹ and vanish. Legality is enforced, never learned.
trainer.py · forward / _legal_maskValue head algorithm
Linear(128→1) off the trunk: "expected result from here." Baseline for GAE (ch. 7), leaf evaluator for search (ch. 9), and the ceiling on both.
trainer.py · Policy.value / batch_valueJoint action mechanic
One action = both slots' choices at once — the Cartesian product, deduped, one mega/turn, capped at 400. Coordination becomes learnable.
src/selfplay/bridge.mjsWhy can't this policy be a normal classifier with one output neuron per action?
Because the action set changes every turn — 3 options one decision, up to 400 the next, 90 at preview — and slot #17 means a different thing each time. A fixed output column would have no stable concept to learn. The pointer network instead learns one scoring function and applies it per menu item, so the menu can be any length.
The immune-hit fix improved the deployed policy from 42% wasted mass to 0% without touching the weights. How?
Policy._legal_mask removes any single-target damaging move that is type-immune against every foe from the legal set at inference, exactly like an illegal action. The softmax then renormalizes the surviving actions, so the deleted probability flows onto useful moves. The weights are unchanged; only the menu shrank. If you can prove an action is never right, delete it — don't retrain and hope.
The value head is 129 parameters. Name two jobs it does beyond "nice to know."
It is the baseline GAE uses to assign credit across a game's ~20 decisions (Chapter 7), and it is the leaf evaluator depth-1 search calls to judge positions one turn ahead (Chapter 9). Per roadmap §3, its noise is likely what capped how deep and wide search could usefully go — the smallest head carries the largest downstream load.