Chapter 2 · Part I — The setup

Borrowing a world

A learner needs somewhere to play millions of games. This chapter is the story of not building that somewhere — and of the plumbing, validation, and data work it took to trust a world someone else built.

After this chapter you can explain

Why the project wraps Pokémon Showdown instead of building a simulator — and how that trust was earned (stat and damage parity).
How Python drives a Node game engine over a pipe, and what the barrier message is for.
How 6 processes × 8 games and one batched inference call bought a ~12× speedup.
The 632 → 599 team funnel, and why nature-backfilled teams never touch the metrics.

The problem: a learner needs a world

Chapter 1 ended with a decision: the rules of good play would be learned, not written. But learning by trial and error has a brutal prerequisite — somewhere to run the trials. The learner will need millions of turns of play, which means a complete, faithful implementation of Regulation M-B: every move, every ability, every item interaction, damage formulas, speed order, Mega Evolution, Trick Room, the works. Building that engine from scratch would dwarf everything else in this project, and every bug in it would silently teach the bot wrong lessons.

Then came the finding that shaped the entire repo. Pokémon Showdown — the open-source battle simulator the competitive community has run on for a decade — already ships this exact format: mod champions, format gen9championsvgc2026regmb. Doubles, level 50, bring-6/pick-4, Open Team Sheets, Megas on, Tera hardcoded off. The hardest deliverable on the board was already built, battle-tested by thousands of daily games, and maintained by someone else.

Key point

The project's biggest single finding is a boring one: wrap, don't build. Showdown lives as an external built clone at ~/repos/pokemon-showdown, and the environment imports its dist/sim directly. There is no simulator code in this repo — only the wrapper around one.

For the curious

That import is also the repo's one remaining hardcoded absolute path — src/selfplay/bridge.mjs points at /Users/dmartins/repos/pokemon-showdown/.../dist/sim while every other path is repo-relative. It's the first thing that breaks on another machine, and it's tracked in docs/tech-debt.md under portability.

Trust, but verify

Adopting someone else's engine only works if it computes the same game as the real app. So before any learning code existed, the project ran a validation phase — think of it as reconciling two systems of record before switching one off.

Stat parity. M-B doesn't use classic Pokémon EV spreads; it uses "SP" spreads — 66 points total, at most 32 per stat, all IVs 31, natures giving ±10%. Showdown's champions mod implements this with its own stat math. The project proved — algebraically and then numerically — that writing the SP value directly into Showdown's EV field (plus a key remap and the nature's plus/minus) reproduces the app's exact stats. That identity is now load-bearing: loadTeam() in the bridge does exactly this for every team, every game.

Damage parity. Damage in this game is drawn from 16 discrete rolls. The project compared Showdown's 16 rolls against the app's own damage calculator on the core cases — neutral hits, STAB plus super-effective, resisted — and got 16/16 exact matches on each. (The fancier modifier matrix — weather, items, crits, spread reduction, Mega formes — was deferred as low-risk validation debt; it's on the list in docs/tech-debt.md.)

War story

The damage rolls initially didn't match, and for a while it looked like the whole wrap-don't-build plan was dead. The real cause: Showdown enumerates its rolls from max down to min, while the calculator's array runs min up to max. Compared index-for-index, two identical distributions look completely wrong. Compared as a sorted multiset — same values, order ignored — they matched 16/16. The lesson generalizes to any reconciliation job: when two systems disagree, audit the comparison before auditing the systems. (docs/caveats.md #4.)

The calc oracle: a second opinion both sides trust

The "app's own calculator" in that story is a real component of this repo: a headless damage/stat calculator lifted from the game app, living in engine/calc.ts over the raw engine and data in legacy-engine/. The project calls it the calc oracle — an independent source of ground truth you can query.

It has had two careers. First, validation: it was the reference Showdown was checked against. Second — and this is a lovely bit of reuse — it was later wired into the live environment: during training, every damaging move-action gets a "does this move KO the target?" feature computed by the oracle (memoized in a cache, because the same matchup recurs constantly). That single feature was worth +4 points of win-rate when it landed. Chapter 4 covers it properly.

In plain terms

The oracle is a shared lookup service. During migration it played the role of the reconciliation source — the table both the old and new pipeline must agree with. In production it became a feature-enrichment service on the hot path: a cached, deterministic "KO or not?" endpoint the environment calls per action. Same service, two jobs.

The bridge: Python drives a Node engine over a pipe

Showdown is JavaScript; the learning stack (PyTorch) is Python. The seam between them is the bridge: a Node process (src/selfplay/bridge.mjs) hosts the Showdown sim, and Python (src/selfplay/vecbridge.py) drives it by exchanging JSON-lines over stdio — one JSON object per line, requests down the child's stdin, responses up its stdout.

In plain terms

It's a microservice with a line-delimited JSON protocol — except the transport is a pipe instead of a socket, and deployment is spawn. Python sends action messages; Node replies with decision requests ("game 3 needs a choice; here are its legal actions and the current state") and terminal notices ("game 5 ended, p1 won"). Everything you know about designing service protocols applies, including the failure modes — as the war stories below prove.

The bridge runs in two modes, set at reset. In vs-random mode, Python controls player 1 and Showdown's built-in RandomPlayerAI plays player 2 — the sparring dummy and, per Chapter 1, the baseline. In self-play mode, Python controls both sides: each decision is tagged with which side it belongs to, and both sides' trajectories are collected for training.

Going wide: 48 games at once

One game at a time is far too slow to feed a learner. So the environment is vectorized in two layers. First, one Node process hosts 8 concurrent games — each message tagged with a game id. Second, a Python-side pool (VecPool, in vecbridge.py) runs 6 such processes, one per core, for 48 simultaneous games. Each round, Python gathers every pending decision across all 48 games and answers them with one batched neural-network call — the network scores all pending decisions in a single forward pass instead of 48 little ones.

The system map. Python's VecPool speaks JSON-lines over stdio to six Node bridge processes, each hosting eight concurrent Showdown games. The calc oracle feeds per-action KO features into the bridge as a side channel. One batched inference call per round answers all pending decisions across 48 games.

The payoff: a training iteration dropped from ~14–18 seconds to ~1.2 seconds — roughly 12×. And not all of that came from parallelism. A chunk came from round-trip economy: whenever a decision has one or zero real options — a forced switch after a faint, a slot with a single legal move — the bridge auto-plays it internally and never sends it across the pipe. Only decisions with an actual choice reach Python, which cut the pipe traffic to about 20 decisions per game — the same ~20 you met in Chapter 1's timeline.

In plain terms

This is textbook batch-and-filter pipeline tuning. Batching: replace 48 chatty request/response cycles with one bulk call per round (the same reason you write to a warehouse in batches, not row-by-row). Filtering at the source: don't ship records that need no processing — a forced switch is a no-op decision, so resolve it at the edge instead of round-tripping it through the model.

The barrier: a watermark for a battle stream

Here's the sneaky-hard part of the protocol, and the most data-engineering-shaped idea in the repo. Python steps all 48 games in rounds: read every game's pending decision, answer all of them, repeat. But how does Python know it has read all of a round's messages? Showdown emits requests asynchronously, and the cadence varies — a faint triggers an extra forced-switch request, a terminal ends a game early. Counting messages is a trap: the expected count changes constantly.

So the bridge doesn't count. Once every active game in a process is quiescent — awaiting an action or finished — the bridge emits a single {barrier:true} line. Python simply reads until the barrier. The barrier is the protocol's statement of completeness: "no more messages this round."

One round of the barrier protocol. Games emit decisions and terminals in whatever order Showdown produces them; message counts vary as games faint and finish. Python never counts — it reads until {barrier:true}, then answers every pending decision in one batch.

In plain terms

The barrier is a watermark, exactly as in a streaming pipeline: you never infer window completeness by counting events (counts drift, sources stall); you wait for the explicit "no more events for this window" signal, emitted by the party that actually knows. The bridge knows when all its games are quiescent; Python doesn't and shouldn't guess.

Even the barrier had its own bug: it originally fired from inside the action-processing loop, when games later in the batch still held their previous state — so quiescence looked satisfied early, the barrier fired, and decisions emitted after it were orphaned. Self-play would deadlock after ~25–50 rounds. The fix is a rule worth framing: never test a completeness condition synchronously mid-batch; check it only in async contexts, once, at the end of each command (docs/caveats.md #6).

Three outages

What follows are the environment's three best production incidents. Not one of them is a machine-learning bug. They are pipe, process, and event-loop bugs — the exact genus of failure a data engineer debugs for a living — and they are the reason this chapter keeps insisting the environment is infrastructure.

War story

The pipe that lied to select(). After vectorization, Python began timing out even though the bridge had visibly written its messages. Cause: the subprocess was opened with text=True, so its stdout was a buffered TextIOWrapper. When the bridge emitted a whole round at once, one readline() slurped several lines into Python's interpreter buffer — then the next select() watched the OS pipe, found it empty (the data was already inside the process), and timed out on data Python was holding in its own hand. The one-line-per-step protocol had never tripped this, because there was never more than one line in flight. Fix: open the pipe binary and unbuffered, own the line buffer yourself, and only select() when your buffer holds no complete line (vecbridge.py: VecBridge._recv_one; docs/caveats.md #5). A pure systems bug, found in the middle of an ML project.

War story

The engine that hangs forever. A specific full-turn resolution in the upstream champions mod can infinite-loop Showdown's sim, permanently blocking that Node process's event loop. The diagnosis was tidy: a setInterval heartbeat in the process stops beating — and a JavaScript watchdog can't save you, because the watchdog runs on the same blocked event loop. The only cure is killing the process from outside. The original mitigation — a 10-second timeout that restarted the whole pool — cost a lost chunk of games one to two times per iteration. The hardened version isolates the blast radius: a 3-second timeout (safe, because the block is permanent — waiting longer buys nothing) restarts only the stuck process while the other five keep their work, and games aborted mid-episode are dropped from the batch so mislabeled outcomes never train the net. Root cause is upstream and still open; the mitigation just made it cheap.

The third incident is smaller but instantly recognizable: killing the Python trainer used to leave its six Node bridges alive forever — they sat reading a dead stdin with no close handler — piling up zombie processes across restarts. The fix is two lines of hygiene: bridge.mjs now exits when stdin closes, and the pool wait()s on killed processes to reap them. The health check is one you'd write for any worker fleet: pgrep -f bridge.mjs should stay approximately equal to the pool size.

Key point

Every serious environment bug in this project was an infrastructure bug — buffering, blocked event loops, orphaned processes, premature completeness signals. The ML on top only works because the plumbing under it was debugged like production plumbing.

The data layer: 632 teams, 599 legal, two pools

A world also needs inhabitants. Training on one fixed team would teach the bot one matchup; the goal is a policy that generalizes across the metagame. So the project assembled a corpus of real tournament teams: 260 from VGCPastes and 372 scraped from op.gg — 632 total, stored in SQLite (data/vgcpastes-replicas.db, data/opgg-replicas.db), alongside data/champions.db holding the game data itself (315 Pokémon, moves, items, abilities, type chart). Each game, the bridge draws a different team for each side, so the learner faces the whole field.

Raw scraped data being raw scraped data, 33 of those 632 teams are illegal under the format — banned move combinations, a bogus megaraichuy, item-clause duplicates. The failure mode was vicious: an illegal team doesn't error politely at load; it throws synchronously inside the sim's start command and kills the whole Node process — taking its seven innocent games down with it. And the bridge's error backstop couldn't help, because it catches bad choices, not bad teams. The fix is the schema-validation reflex: run Showdown's TeamValidator over the corpus once at load and drop the rejects. 599 legal teams survive. Validate at the boundary, not at point of use.

The corpus funnel. 632 scraped teams; Showdown's TeamValidator drops 33 illegal ones at load (each of which used to crash a whole process at start); the 599 legal teams then split into a clean pool and a nature-backfilled, draw-only pool that training may face but measurement never uses.

One data-quality problem remained, and it's the most data-engineering story in the repo. About 243 of the op.gg teams were scraped with no nature at all — the ±10% stat modifier from Chapter 2's stat-parity section. They silently defaulted to a neutral nature, mis-statting every affected Pokémon: wrong attack, wrong speed order, wrong damage, wrong KO features. The repair: backfill natures by copying from real sets of the same species — an exact-spread twin if one exists, else the modal nature for that spread shape, else a sensible rule. Leave-one-out accuracy: ~84% where an exact-spread twin exists (near the ceiling — identical spreads genuinely split between two natures in the real meta), ~73% otherwise.

And then the crucial discipline: backfilled teams are draw-only. They stay in the training pool as sparring partners — 220 realistic teams are too valuable to discard — but they are excluded from the clean pool that every measurement path draws from (the Elo ladder, the team benchmarks). An 84%-confident guess is fine to practice against; it is not fine inside a number you'll use to decide whether a new model is better.

In plain terms

This is imputation policy, stated the way a data team would state it: imputed rows may feed training, never the metrics table. Backfilling from same-species sets is imputing a missing categorical from cohort statistics; leave-one-out accuracy is the honesty check; and the clean/draw-only split is the wall between "data good enough to learn from" and "data clean enough to report on." (A team-level confidence gate was considered and rejected — the natures are missing wholesale, ~6 per team, so gating would keep 0 of the 243. All-or-nothing, so: blind backfill, quarantined from metrics.)

bridge.mjs tool

The Node side: hosts the Showdown sim, 8 games per process, enumerates legal actions, auto-plays forced decisions, emits decisions and the round barrier as JSON lines.

src/selfplay/bridge.mjs

VecBridge / VecPool tool

The Python side: owns the pipes (binary, unbuffered), reads rounds up to the barrier, runs 6 bridge processes, and restarts just the stuck one on a timeout.

src/selfplay/vecbridge.py

calc oracle tool

Headless damage/stat calculator lifted from the app. Proved stat and damage parity; now answers "does this move KO?" live, per action, memoized.

engine/calc.ts

team corpus data

599 validator-legal real teams (of 632 scraped) in SQLite, archetype-tagged, split into a clean measurement pool and a nature-backfilled draw-only pool.

data/opgg-replicas.db · data/vgcpastes-replicas.db

Check yourself

Why does the bridge emit a barrier instead of letting Python count messages per round?

Because the expected message count isn't stable: Showdown emits requests asynchronously, faints add forced-switch requests, terminals remove games mid-round, and forced decisions are auto-played inside the bridge and never sent at all. Only the bridge knows when every active game is quiescent, so it emits one {barrier:true} — a watermark — and Python reads until it arrives.

Showdown's damage rolls "didn't match" the calculator at first. What was actually wrong?

Nothing in either engine. Showdown enumerates its 16 rolls max-to-min while the calculator's array is min-to-max, so an index-by-index comparison of two identical distributions fails. Compared as sorted multisets, they matched 16/16. Lesson: when two systems disagree, check the comparison logic before suspecting the systems.

Why are the ~220 nature-backfilled teams allowed in training but banned from measurement?

Their natures are imputed (copied from same-species sets, ~84% leave-one-out accuracy at best), so their stats may be subtly wrong. As sparring partners that's harmless — they're still realistic teams. But a measurement number built on possibly-wrong stats is a number you can't trust to compare checkpoints, so every metric path (Elo, benchmarks) draws only from the clean pool. Imputed data trains; it never reports.