Reference

Glossary

Every term the course uses, one card each. Each entry gives the general meaning first, then Here: the concrete instance in this project — with a real number or a real file wherever one exists.

The game

Open Team Sheets mechanic

A tournament rule making both players' full teams — species, items, moves — visible from team preview on. Here: it makes Reg M-B near full-information, which is why the policy embeds both 6-mon rosters and why search needs no belief-tracking over hidden teams.

Trick Room mechanic

A move that reverses the speed order for five turns, so slow Pokémon move first — a whole team archetype is built around it. Here: the project's canary. ExIt ground its usage from 12–13% to 0.8% while win-rate rose; the gate now floors it at mean probability ≥ 4% on dedicated TR teams.

Archetype data

A team's strategic plan class — the win condition it is built around. Here: every corpus team is tagged in replica_team_labels with priority weather > semi_room > trick_room > tailwind_ho > balance; the gate's floors and archetype_bench.py both key off these tags.

Stage game algorithm

One simultaneous-move turn modeled as a matrix game: my actions × your actions, each cell an outcome. Sequential minimax is wrong here — it leaks your move to the opponent. Here: search.py prunes to top-k actions per side and solves the k×k zero-sum matrix for a mixed strategy.

The environment

Pokémon Showdown tool

The open-source battle simulator the competitive community runs on. Here: it already ships Reg M-B (mod champions, format gen9championsvgc2026regmb), so there was no simulator to build — the project wraps a built clone at ~/repos/pokemon-showdown.

Bridge tool

A translation process that exposes one system to another over a simple protocol. Here: bridge.mjs, a Node process that drives Showdown, hosts N concurrent games, computes every action's feature row, and talks JSON-lines over stdio to Python.

VecPool tool

A vectorized environment: many game instances stepped in lockstep so one batched network call serves them all. Here: vecbridge.py runs 6 bridge processes × 8 games = 48 games per inference call — part of the ~12× speedup (~14–18 s → ~1.2 s per training iteration).

Calc oracle tool

A trusted external calculator consulted at decision time instead of learned. Here: engine/calc.ts, the app's own damage/stat math — it validated Showdown's parity, then became the source of the koFrac/koBit "does this move KO?" features. It transfers to any new metagame for free.

Corpus data

The dataset of real examples a system trains on. Here: 632 real tournament teams scraped into SQLite (260 VGCPastes + 372 op.gg), TeamValidator-filtered to 599 legal teams; the bridge draws an asymmetric pair per game.

Action space mechanic

The set of choices available at each decision. Here: all legal joint two-slot combinations — moves × targets × +mega variants × switches — capped at 400, plus the 90-way pick-4 + leads choice at team preview. Each action carries a feature row and id channels.

Observation data

What the agent sees of the world each step, as numbers. Here: featurize.py builds a 114-dim state vector (HP, status, weather, terrain, speed order, bench health…) plus parallel id channels for species, items, abilities, and moves.

Preview features data

Matchup summaries attached to draft-time actions, so the pick-4 scorer can see the foe directly. Here: PREVIEW_FEATS=1 fills action dims 30–59 with per-lead type-effectiveness, outspeed fraction, and Fake Out/Intimidate exposure; the first ablation made 5/10 teams foe-adaptive vs 0/10 without.

Learning

Policy algorithm

The decision-maker: a function from observation to a probability distribution over actions. Here: the pointer network in trainer.py; the deployed one is policy_wider, saved as checkpoints/policy_default.pt.

Value head algorithm

A second output on the same network estimating "how good is this position?" — expected final reward. Here: a Linear(128,1) off the shared trunk; it bootstraps credit through GAE and serves as the leaf evaluator that caps search quality.

Reward mechanic

The scalar feedback learning optimizes — the only "opinion" the system is given. Here: a sparse terminal ±1 per game (win/loss), densified by PBRS shaping so intermediate progress earns interim credit.

Episode mechanic

One complete run from reset to terminal state. Here: one full battle — about 20 decisions reach the policy, since the bridge auto-plays anything with ≤1 legal option.

PPO algorithm

Proximal Policy Optimization: policy-gradient learning with a clipped update, so each step can only move the policy a bounded distance from the one that collected the data. Here: the training algorithm in trainer.py, run by selfplay_train.py.

Self-play algorithm

Training against copies of yourself, so the opponent improves exactly as fast as you do. Here: the learner plays a frozen league of its own past snapshots; the improved player becomes its own next opponent.

Frozen league mechanic

A pool of past policy snapshots, weights frozen, used as training opponents to prevent overfitting to a single rival. Here: the fix for the late win-rate collapse — a stochastic league opponent (plus lr decay and best-by-confirm) ended the brittle-exploit cycle.

GAE algorithm

Generalized Advantage Estimation: blends observed rewards with value-head predictions (parameter λ) to reduce variance in "was this action better than expected?". Here: λ = 0.95, and it was the single biggest strength gain of the project — head-to-head Elo 80 vs 44 without it.

PBRS algorithm

Potential-based reward shaping: add interim rewards as differences of a potential function Φ over states — provably without changing what the optimal policy is. Here: Φ = HP material + speed control + boosts + screens; worth +19 Elo, and the thing that originally taught Trick Room.

Embedding data

A learned vector of numbers standing in for a categorical id — the network's private notion of what "Incineroar" means. Here: species (vocab 165, dim 16, shared between state and action sides), moves (vocab 332, dim 12), items and abilities (dim 8).

Pointer network algorithm

An architecture that scores each candidate from a variable-size set against a context encoding, instead of having one fixed output per action. Here: essential because the legal action set changes every turn; the net encodes the state to 128 dims, then scores each legal action's features against it.

Masked softmax algorithm

Converting scores to probabilities over only the legal options — illegal ones are masked to zero probability, structurally. Here: the policy can never click an illegal action by construction, and Policy._legal_mask also drops fully type-immune attacks (the immune-hit fix, no retrain needed).

Entropy metric

A measure of how spread out a probability distribution is: 0 = all mass on one option, higher = flatter. Here: the preview head's entropy of 0.54 with top-p 0.30 diagnosed the foe-blind draft — a broad head whose ordering the foe never changed.

Credit assignment mechanic

The problem of deciding which of many earlier decisions deserves blame or credit for one delayed outcome. Here: one ±1 per ~20 decisions; solved well in battle by PBRS + GAE, but too weak to make the single preview decision foe-sensitive — hence preview features.

Checkpoint data

A saved snapshot of a network's weights, restorable later. Here: the .pt files in src/selfplay/checkpoints/; new saves carry metadata {state_dict, preview_feats, afeat} so each policy is fed the inputs of its own feature era.

Sampling (vs greedy) mechanic

Drawing an action from the policy's distribution, rather than always taking the argmax (greedy). Here: every eval tool samples by default — greedy once hid mc6's entire +22.4 pp gain, and it undersells stochastic setup lines like Trick Room.

Measurement

Elo metric

A rating system from chess: beat stronger opponents, gain more points; ratings predict win probabilities. Here: random is anchored at 0; the deployed policy_wider rates 309, mc7 301, wide 244 (src/evaluate/elo.py). Team-draw variance makes it coarse — ~6% resolution at 300 games.

Pilot bench tool

A paired benchmark: hold the team, field, seeds, and opponent fixed, vary only the pilot — so team-matchup luck cancels out. Here: src/evaluate/pilot_bench.py, the sharp tool (~1 pp standard error) that resolves the 1–3% piloting edges Elo drowns.

Regression gate tool

An automated pass/fail check that a new version hasn't lost required behavior — CI for strategy. Here: src/evaluate/policy_regression.py: floors on Trick Room (4%), Tailwind (4%), Protect (7%), Fake Out (5%), mega usage (≥80%), plus the blocking immune-hit check. Seeded and deterministic; no checkpoint promotes without passing.

vs-random pitfall

Win-rate against a uniformly random opponent — the cheapest possible strength probe. Here: saturated (~90%) and blind: two checkpoints both read 92% vs-random yet rated Elo 94 vs 37 against each other. Printed each training iteration as a coarse floor, never used for promotion.

Goodhart's law pitfall

"When a measure becomes a target, it ceases to be a good measure" — optimizing a proxy destroys what the proxy stood for. Here: the whole Chapter 10 story: promoting on scalar win-rate alone shipped mc7, a policy that wins more and never sets Trick Room. The gate is the countermeasure.

Search & distillation

Determinization algorithm

Fixing the random seed so one stochastic transition becomes a single concrete sample — then averaging over several seeds to estimate the true expectation. Here: each search cell is averaged over m seeds; widening m (1 → 3 → 5) was half of the "wider teacher" lever.

Expert iteration (ExIt) algorithm

Improve-then-imitate: use search as a temporary expert, record its choices, and train the raw network to reproduce them — planning gets baked into instinct. Here: exit_gen.pyexit_train.py; one round is 150 self-play games ≈ 2,400 search-labelled decisions. Produced the mc6→wider lineage — and the Trick Room collapse.

Distillation algorithm

Training a student network to match a teacher's output distribution (cross-entropy), rather than learning from rewards. Here: the imitate half of ExIt — and the dangerous half: it faithfully copies the teacher's blind spots, grinding delayed-payoff moves toward zero (Protect 14.5% → 3.9%).

Reward anchoring algorithm

Protecting a class of decisions during distillation by pinning their target to a trusted reference instead of the teacher. Here: exit_gen --anchor pins utility-move mass to the RL policy that learned those moves correctly via PBRS; search still teaches everything else. This produced mc9, the first gate-passing ExIt checkpoint.

KL divergence metric

A measure of how far one probability distribution has drifted from another — the standard "distance" between two policies. Here: the kl-leash adds a KL-to-mc5 penalty during distillation, letting more epochs run without the student drifting away from mc5's strategy repertoire.

Joint action mechanic

In a multi-unit game, one decision that commits all your units at once. Here: doubles means each turn's action is a pair of slot choices, enumerated as the Cartesian product of both slots' options — which is why the action set runs to hundreds.

The checkpoint lineage

mc5 data

The last pre-ExIt RL policy — pure PPO self-play with benefit-aware weather Φ. Here: it passes the full gate, plays every strategy it was taught, and became the anchor reference every later fix leans on. Its flaw: a near-flat distribution that sampled poorly.

mc6 data

The first ExIt round: search over mc5, distilled back in. Here: it sharpened mc5's flat ~2%-everywhere distribution into calibrated 22/18/16% peaks — worth +22.4 pp under sampled play, a gain greedy evaluation showed as +1.1 pp noise.

mc7 data

The second ExIt round and the strongest checkpoint of its era: 78.3% vs the mc2 field, promoted to default on win-rate. Here: the cautionary tale — it fails the gate, setting Trick Room in 0.8% of legal chances (never once on a dedicated TR team). Win-rate hid the lobotomy through two promotions.

mc8 data

The third ExIt round: same teacher recipe over mc7. Here: converged — cross-entropy flat, −0.6 pp vs mc7 — proof that re-distilling the same depth-1 teacher buys nothing once the student reproduces it. Never promoted.

mc9 data

Reward-anchored ExIt: utility-move targets pinned to mc5's prior, light 3-epoch distillation. Here: the first ExIt checkpoint to pass the gate — at a strength cost (61.1% vs mc7's 78.3% against the mc2 field). Promoted anyway, and served as the default until the wide-teacher line.

kl-leash data

Anchored distillation plus a KL-to-mc5 penalty, allowing more epochs safely. Here: the best strategy retention ever measured — Trick Room at 10.3% — but tied on strength, and strength is the goal. Archived, not deployed.

wide data

The wider-teacher experiment: exit_gen --k 10 --m 3 over mc5 — more candidate actions, more determinization seeds. Here: the first gate-passing checkpoint to beat the old default (+3.6 pp pilot-bench, Elo 260 vs 253), proving the ceiling was search budget, not depth.

wider data

The budget pushed further: exit_gen --k 12 --m 5 over mc5, same kl-leash distillation. Here: the deployed default — Elo 309, the first gate-passer to beat raw mc7 (301), pilot-bench 84.5% vs the mc2 field, Trick Room at 4.1% (right on the floor). Saved as policy_default.pt.