Chapter 3 · Part II — The learner

Learning by playing

Nobody could write down the rules of good doubles play, and there was no dataset of expert games to copy. What's left is the third option: let the system play the game, tell it only whether it won, and let it figure out the rest. This chapter is reinforcement learning from zero, using nothing but this project's own parts.

After this chapter you can explain

Why hand-written rules and supervised learning were both dead ends here — with the ~22% number that proved it
What agent, environment, observation, action, reward, and episode mean in this exact repo
Why the policy outputs probabilities and samples from them instead of picking one "best" move
What the value head predicts and why it matters later, and what the 50% → 88% milestone actually proved

Two roads that dead-end

Before "learn it by playing" earns its place, the two obvious alternatives have to lose. They lost early, and one of them lost embarrassingly.

Road one: write the rules yourself. Build a heuristic — a hand-written decision rule — like "always click the move that deals the most damage." It sounds reasonable. The project tried exactly this, and the max-damage heuristic won about 22% of its games against a player clicking random buttons. Not 50%. Twenty-two. A carefully written rule lost, badly, to noise.

War story

The 22% result (recorded in docs/caveats.md as "not a bug, a finding") is structural, not a coding mistake. Greedy attacking ignores everything that actually wins VGC doubles: Protect timing, Fake Out tempo, speed control like Tailwind and Trick Room, redirection, and focus-firing one target so it actually faints. A max-damage bot pours damage into whoever soaks it best while the opponent — even a random one — occasionally stumbles into the plays that matter. The finding set the project's baseline: random is the bar, and if a human-authored rule can't beat random, nobody is going to hand-write their way to strong play.

Road two: copy the experts. This is supervised learning — show a model millions of examples of "in this position, the expert did X" and train it to imitate. It's how you'd train a spam filter or an image classifier, and it works when a big labeled dataset exists. Here it doesn't. Regulation M-B is a niche format; there is no archive of millions of expert games annotated with their decisions. No dataset, no imitation.

What's left is the third road: don't tell the system what good play looks like at all. Let it play, tell it only whether it won or lost, and let it work backwards from that. This is reinforcement learning (RL): learning behavior from trial, error, and a score.

The frame: six words, one project

RL textbooks open with an abstract diagram of an "agent" interacting with an "environment." Every word in that diagram maps to a concrete file in this repo, so let's define the vocabulary using nothing but the project itself.

Agent mechanic

The thing making decisions: here, the policy network — a small neural net that reads the battle and picks a button. It starts knowing nothing.

src/selfplay/trainer.py

Environment mechanic

The world the agent acts in: Pokémon Showdown, wrapped by the Node⇄Python bridge from Chapter 2. It takes an action and returns what happened.

src/selfplay/bridge.mjs

Observation data

What the agent sees each turn — the battle flattened into numbers (an observation). Chapter 4 is entirely about what goes into it.

src/selfplay/featurize.py

Action mechanic

One legal joint choice for both of your active Pokémon — "slot 1 uses Fake Out on foe A, slot 2 Protects." The bridge enumerates the full legal action space each turn.

src/selfplay/bridge.mjs

Reward metric

The score. Here it is brutally simple: +1 for winning the game, −1 for losing, 0 for every other moment. That's the entire teaching signal — no partial credit, no hints. (See reward.)

src/selfplay/trainer.py

Episode data

One complete game, start to finish — an episode. In this env a game reaches the agent as roughly 20 decisions (forced choices are auto-played inside the bridge and never surface).

src/selfplay/vecbridge.py

Put together, one loop of the system reads: the environment sends an observation; the agent picks an action; Showdown resolves the turn; a new observation comes back; repeat ~20 times; then, and only then, a single +1 or −1 arrives.

The whole learning problem in one picture. Numbers in, one button-press out, ~20 times per game — and the only feedback is a single +1 or −1 when the game ends. Everything in Chapters 4–7 exists to make learning possible under exactly these conditions.

Key point

The reward really is just win/lose. Nobody rewards "good" moves, dealing damage, or keeping Pokémon alive. Any behavior the trained policy shows — Protecting at the right time, Mega-evolving turn 1 — was discovered purely because it correlated with that terminal ±1.

A policy is a probability table, not an answer

The word policy deserves a precise definition, because everything downstream hangs on it. A policy is a function: observation in, a probability for every legal action out. Not "the best move." A full probability distribution — this turn, 34% on Fake Out + Protect, 22% on Fake Out + Rock Slide, and so on down the legal list, summing to 100%.

To act, the agent doesn't take the top entry. It samples — it rolls a weighted die over the table. That sounds like a flaw. It's the design.

Reason one: exploration. You cannot learn the value of a move you never play. If the untrained net happens to slightly prefer attacking over Protecting, a "take the max" agent would Protect literally never — and never collect the evidence that Protect wins games. Sampling guarantees every plausible action keeps getting tried, so the win/loss statistics keep flowing for all of them.

Reason two: unpredictability. A deterministic player is an exploitable player — in a game of simultaneous moves, an opponent who knows exactly what you'll click can prepare the perfect counter every turn. This isn't hypothetical in this project: Chapter 6 tells the story of a training run that collapsed from 80% to 7% precisely because a deterministic opponent turned the game predictable, and Chapter 8 returns to sampled-vs-greedy as an evaluation question.

A policy is a weighted menu, not a verdict. Illustrative probabilities for one turn's legal joint actions. The agent rolls a die over the blue bars; illegal actions (out of PP, fainted target) are removed before the probabilities are computed, so the net cannot even express a preference for them — that's the masked softmax, coming in Chapter 5.

The learning rule, in one honest paragraph

So how does a random-clicking net become a player? The core idea — called policy gradient — needs no math to state. Play a batch of games. For every decision in a game that ended in a win, nudge the policy's probability for the action it took up a little. For every decision in a lost game, nudge it down a little. Repeat tens of thousands of times. Actions that keep showing up in wins accumulate probability; actions that keep showing up in losses bleed it away.

Every hard part of Part II is hiding inside two words in that paragraph. "Nudge" — how big a step, and how do you avoid one bad batch wrecking a good policy? That's PPO, the update algorithm, and it's Chapter 6. "Every decision" — the win arrived 20 decisions after your brilliant turn-2 Fake Out; which of the 20 actually deserves the credit? That's the credit-assignment problem, and it's Chapter 7. For now, hold the naive version: wins push probabilities up, losses push them down, and stability is the engineering.

In plain terms

The whole loop is closed-loop tuning, like a database that experiments on its own query plans: try a plan variant (sample an action), run the workload (play the game), measure latency (win or lose), and shift future traffic toward the variants that measured well. No human ever writes a plan. The optimizer's only input is the measurement — and, like a query tuner, it will happily optimize whatever you measure, which becomes the plot of Chapter 10.

The second output: a built-in win-probability estimator

The network actually produces two things from every observation. The first is the policy — the probability table above. The second is a single number: from this position, how likely am I to win? This output is called the value function, or the critic, and in this codebase it lives as the value head in src/selfplay/trainer.py.

It's trained by simple hindsight: after each game, every position in it gets labeled with how the game ended, and the value head is nudged toward predicting that. Over many games it becomes a calibrated win-probability estimator — the system's own sense of "am I ahead?"

Plant this one firmly, because it quietly becomes load-bearing twice. In Chapter 7, the value head is what lets the trainer say "this move raised our win probability from 60% to 75%, credit it now" instead of waiting for the terminal ±1 (that's GAE). In Chapter 9, it becomes the evaluator that scores hypothetical futures for depth-1 search. Same head, three jobs.

Exploration, exploitation, and a number called entropy

Every learning agent lives on a dial between two poles. Exploit: play what currently looks best, and win now. Explore: play something less-fancied, and maybe learn it was better all along. Lean too far toward exploring and you never get good; too far toward exploiting and you freeze early on whatever you stumbled into first.

The project tracks this dial with one measurable number: entropy — simply, how spread out the policy's probabilities are. A policy putting 25% on each of four actions has high entropy (still open-minded); a policy putting 99% on one action has near-zero entropy (mind made up). Collapsing entropy is dangerous for exactly the two reasons sampling exists: a near-zero-entropy policy has stopped exploring, so it can't recover from a wrong conviction, and it has become predictable, so it can be exploited. When Chapter 6's training collapse happened, cratering entropy was the smoking gun on the dashboard.

The first milestone: 50% to 88%

Here's how the loop's very first end-to-end test went. To keep the measurement clean, both sides played the same fixed team — a mirror match, so no team-quality luck — and the opponent was Showdown's built-in random player. By definition, that makes the baseline crisp: an untrained net is itself essentially a random clicker, and random-vs-random sits at about 50%.

Then PPO ran. After training, the policy was evaluated over 200 games and won 88% of them against random. That number, still quoted in the README, was the project's proof of life: the bridge delivers observations correctly, the featurizer isn't scrambling them, the action encoding is legal, the reward reaches the right games, and the nudging genuinely nudges. Any one of those broken, and the curve stays flat at 50%.

Proof of life, not proof of strength. The endpoints are the project's real numbers (untrained ≈ random ≈ 50%; trained 88% over a 200-game evaluation on the fixed mirror); the curve's shape is schematic. This graph says "the pipeline works end to end" — it does not yet say "the bot is good."

And now the honest caveat, because this project learned it the long way: vs-random is a weak yardstick, and a saturating one. Random never punishes your mistakes, so once you're winning ~90% of the time, real improvements barely move the number — and real regressions can hide under it entirely. The README records the moment this bit: a later policy hit 93% vs-random while still Earthquaking its own ally and skipping Fake Out, mistakes vs-random simply cannot see. The measurement story — Elo ladders, paired benchmarks, and playing against past versions of yourself via self-play — continues in Chapters 6 and 8.

For the curious

Why is the baseline "≈ 50%" and not exactly 50%? Two random players on identical mirror teams are symmetric, but games can also time out (the env cuts games at 50 turns and scores them 0), and per-run variance over a few hundred games is a few points either way. The project treats vs-random as an anchor with error bars, never a precision instrument — which is exactly why 86% vs 88% arguments eventually gave way to better metrics.

Key point

Reinforcement learning here means: probabilities over legal actions, sampled; one ±1 per ~20-decision episode; wins nudge probabilities up, losses nudge them down; a value head learns "how likely am I to win from here" on the side. The 50% → 88% run proved this loop works. Everything after it is about making the learner see better (Ch. 4), decide better (Ch. 5), train stabler (Ch. 6–7), and be measured honestly (Ch. 8+).

Check yourself

Why couldn't the project just hand-write a good policy, or train one from expert games?

Hand-written rules failed empirically: the max-damage heuristic won only ~22% vs a random player, because greedy attacking ignores Protect, Fake Out, speed control, and focus fire — the knowledge is too implicit for anyone to write down. Supervised learning failed for lack of data: there's no large labeled dataset of expert Regulation M-B games to imitate. That leaves learning from the game itself: reinforcement learning.

The policy prefers action A at 34%. Why does the agent sometimes deliberately play the 7% action instead?

Because it samples from the probability table rather than taking the maximum. Sampling buys two things: exploration (you can't learn an action's value if you never play it, so every plausible action keeps generating win/loss evidence) and unpredictability (a deterministic player can be perfectly countered in a simultaneous-move game — a lesson the project relearned painfully during the Chapter 6 collapse).

What does the value head output, and why should you remember it now rather than in Chapter 7?

It outputs a single number per observation: the estimated probability of winning from this position, trained by labeling every position with how its game actually ended. It matters early because it later becomes load-bearing twice — GAE uses it to hand out per-move credit long before the game ends (Chapter 7), and depth-1 search uses it to evaluate hypothetical positions one turn ahead (Chapter 9).