Chapter 6 · Part II — The learner

Training day

The network from Chapter 5 starts with random weights. This chapter is about the loop that makes them good — PPO, self-play against a shelf of frozen past selves — and about the run where win-rate climbed to 92%, cratered to 40%, and every dashboard said everything was fine.

After this chapter you can explain

What PPO actually does, in one sentence, and why the "proximal" part is the whole point
Why the self-play opponent is a frozen league of snapshots — never the live learner
The late win-rate collapse: the curve, the diagnosis (a deterministic opponent), and the three-part fix
Why the KL guard read "all calm" through the entire collapse, and what that teaches about metrics

The update problem

Picture the moment after data collection. You just played 48 games with the current policy. Some of them ended in wins, some in losses, and every decision along the way got recorded: the position, the menu, the chosen action, its probability at the time. Now you have to answer one question: how hard do you yank the probabilities?

The tempting answer is "hard" — this action preceded a win, crank it way up. The problem is that all 48 games were played by the old policy. The recorded probabilities, the visited positions, the whole dataset describes how the old policy behaves. If one update jumps the network far from where it was, the data that justified the jump no longer describes the network you now have. You've deployed a big-bang rewrite based on evidence gathered from the previous release, and your next batch of games gets played by a stranger.

Yank too gently and nothing improves. Yank too hard and you destroy what worked. Every deep RL algorithm is a stance on this dial.

PPO: nudge, but stay close

PPO — Proximal Policy Optimization — is the stance this project takes, in ppo_update in src/selfplay/trainer.py. Plainly: for each recorded decision, nudge the chosen action's probability up (if it preceded good outcomes) or down (if bad), but clip the size of any single update so the new policy stays near the policy that gathered the data. That's the "proximal" — nearby. Concretely, PPO looks at the ratio of new probability to old probability for each action, and if a nudge would push that ratio past 1 ± 0.2 (the clip=0.2 in the code), it stops paying gradient for going further. There is no reward for lurching.

In plain terms

PPO is a canary deploy policy for weights. You never big-bang the whole system on one batch of evidence — you ship many small, individually-reversible changes, each validated against traffic that still resembles the traffic it was tested on. The clip is the deploy gate: "no single release moves any behavior more than 20%." Boring by design. The alternative, taking huge steps on stale evidence, is exactly how the collapse below happened.

Two more knobs live in the same function. The update runs several passes (4 epochs, minibatches of 256) over the batch, squeezing more learning from expensive-to-collect games. And an entropy bonus (coefficient 0.01) pays the policy a small reward for staying uncertain — entropy is a measure of how spread-out a probability distribution is, and keeping some spread means the policy keeps exploring instead of freezing onto one answer. Remember that word; it's the needle that moves in the war story.

One training day, concretely

The loop in src/selfplay/selfplay_train.py is three verbs, repeated ~120 times:

Collect. Play 48 games in parallel through the VecPool (6 Node processes × 8 games each), the learner driving p1 against an opponent we'll meet in a second. Every learner decision is recorded as a transition. In self-play collection generally, both sides' decisions can be recorded — two trajectories per game for the price of one simulation.
Update. One PPO update over the batch: 4 clipped epochs, plus the KL guard we'll get to.
Eval. Every second iteration, a quick 60-game read against vs-random (±6 percentage points of noise at that sample size — a smoke test, not a verdict), and when it looks strong, a slow 200-game confirm.

By the start of the vectorized era each collect-update cycle took about 1.2 seconds — down from 14–18 seconds before vectorization. That ~12× speedup is what made everything in this chapter observable at all: the collapse only surfaced once long runs became cheap enough to actually run.

The opponent: a frozen league of past selves

Who does the learner play against? The obvious answer — itself, live — is a trap. If both players are the same network updating every iteration, the environment itself shifts under the learner every step. You're optimizing against a moving target that moves because you optimized. Learning becomes chasing your own tail: beat strategy A, opponent morphs, strategy A stops being tested, forget it, opponent rediscovers it, lose to it. Rock–paper–scissors forever.

So the opponent is a frozen league: a shelf of past snapshots of the learner. The run starts with one snapshot (the untrained net, gen_000) and every 6 iterations adds a frozen copy of the current learner. Each collection batch samples one league member at random to be p2. The opponent is stable within a batch (learnable signal), varied across batches (no overfitting to a single opponent), and historically deep (old strategies stay in the test set, so they can't be silently forgotten).

On top of that sits best-by-confirm checkpointing: the learner is only saved as the best checkpoint when a long 200-game confirm run beats the previous best's confirmed score. A hot streak on a noisy 60-game eval saves nothing. The last known-good build is never overwritten by a spot check.

In plain terms

The league is integration-testing against pinned versions. You don't test today's build only against today's build — you keep a shelf of released versions and require the new one to hold up against all of them. And best-by-confirm is artifact promotion: a build gets tagged latest-good only after the full regression suite passes, never off a green smoke test. Both of these exist in this repo because of what you're about to read.

The learner never fights its live self. Opponents come off a shelf of frozen past snapshots — stable within a batch, varied across batches, and never forgetting old strategies. The green gate at the bottom is why a later collapse can no longer destroy a good build.

The war story: the late win-rate collapse

Here is the chapter's heart, straight from docs/caveats.md. It happened twice — once as a mystery, once as a reproduction that revealed the cause.

War story

Act one: the mystery. The original vs-random trainer (trainer.py, lr 3e-4, no decay) had always early-stopped at its first milestone. Once vectorization made long runs cheap, the full curve appeared — and it was ugly. Untrained: 69% (a greedy net's consistency already exploits a random opponent). Iteration 4: 85%. Iteration 6: 92%, the peak. Then it bounces between 78 and 91 for a couple dozen iterations… then iteration 32: 64%. Iteration 34: 42%. Iterations 36–38: 40% — worse than a coin flip — before limping back to around 53% and staying there. The policy learned fast, peaked, and then destroyed itself. One mercy: the 200-game confirms at the peak came in at 86–87%, just under the 88% save bar, so no checkpoint was ever overwritten. The old 88% build survived by luck and a strict gate.

Act two: the reproduction that solved it. When self-play landed, the collapse was reproduced deliberately — and it came back worse. With the rollout opponent playing greedy (deterministic — always its single highest-probability move), self-play went 80% → 7%, below the untrained network. That extremity was the clue. A deterministic opponent makes the entire environment deterministic: same position, same response, every time. PPO, doing exactly its job, finds one narrow line of play that exploits that fixed script perfectly. Entropy craters — the policy stops being a distribution and becomes a memorized sequence. And a memorized sequence against one script is useless against literally anyone who varies.

The fix that mattered: make the rollout opponent sample its moves instead of playing greedy (opp_greedy=False in collect_selfplay — the code comment reads "a deterministic one invites brittle exploits"). Supporting cast: learning rate 1e-4 with cosine decay, and best-by-confirm checkpointing. Result: stable ~90–98% vs-random across all 120 iterations, no collapse. One boolean was load-bearing.

The historical collapse, plotted from docs/caveats.md. Solid segments connect the documented eval points (69 → 85 → 92 → 64 → 42 → 40 → 40 → 53); the dashed segment crosses the documented "bounces 78–91%" region, shown as a band rather than invented points. The same pathology reproduced in self-play against a greedy opponent as 80% → 7% — below untrained.

Key point

PPO didn't malfunction — it over-succeeded against a target that never varied. A deterministic opponent makes the environment deterministic, and a deterministic environment rewards memorizing one exploit line over learning to play. Randomness in the opponent isn't noise to be eliminated; it's what forces the policy to stay a policy.

The KL guard that said everything was fine

Here's the part worth a permanent place in your engineering brain. The PPO update has a safety metric: KL divergence, a standard measure of how far one probability distribution has moved from another — here, how much the policy's action probabilities shifted in one update. The code targets 0.02: if an update's measured KL exceeds that, the remaining epochs are skipped. A guard against exactly the kind of lurch that destroys policies.

During the collapse, the measured KL sat around 0.001 — twenty times under the alarm threshold — the entire time. The guard watched the policy destroy itself and reported calm at every step. How? The evals ran the policy greedy: always the argmax, the single highest-probability action. When two actions sit at 30.1% and 29.9%, a microscopic logit shift — nearly invisible to KL, which sums tiny probability changes across the whole distribution — swaps which one is the argmax. Behavior flips completely; probability distance barely moves. Many decisions, many near-ties, compounding over a 20-decision game: the played-out behavior drifted enormously while the guarded quantity stayed asleep.

In plain terms

This is the incident where every internal dashboard is green while the service returns garbage. CPU fine, memory fine, p99 fine — because the thing that broke isn't what the gauges measure. KL watched the distribution; the failure lived in the argmax. The lesson generalizes to any metric-watcher: a healthy-looking internal metric is not a healthy system. Instrument behavior — the actual outputs users receive — not just internals. The guard stayed in the code as a backstop, but nobody pretends it's the mechanism anymore.

For the curious

A related trap from the same caveats entry: the run's vs-initial eval (learner against its untrained snapshot) plays greedy-vs-greedy, which makes the games near-deterministic — few genuinely independent outcomes per batch, so the number is noisy and jumpy. The stochastic vs-random eval is the trusted anchor. Determinism corrupting a measurement instead of a policy — same root cause, different victim. Also: trainer.py was deliberately never retrofitted with the fix; it keeps the old lr 3e-4 greedy-era behavior and can still collapse on long runs. Only selfplay_train.py carries the cure.

Learning-rate decay, in one paragraph

The learning rate is the size of each weight nudge. The fixed run used 3e-4 forever; the fixed-collapse run uses 1e-4 with cosine decay down to 1e-5 across the 120 iterations — big steps early, when the weights are random and any direction is an improvement, shrinking smoothly to small steps late, when the policy is good and a big step can only break something delicate. The ML word is annealing, after the metallurgy: cool the metal slowly and the structure settles; quench it and it cracks. Early on you're roughing out the shape; by iteration 100 you're polishing, and you don't polish with a hammer.

The clip is the whole idea. The gradient regularly proposes jumps far outside the region where the collected data means anything. PPO takes only the portion inside the trust zone, then re-collects. "Proximal" = stay close to the policy your evidence describes.

What training can't fix

End-of-chapter honesty. This loop is magnificent at one thing: making the number you gave it go up. It was given "win the game" — a single ±1 reward per episode — and it will optimize exactly that, with the full creativity of millions of games, and nothing else. Two consequences are already looming.

First: one ±1 per ~20 decisions is brutally thin evidence. Which of those twenty clicks earned the win — the Fake Out on turn 1 or the lucky Rock Slide on turn 9? That's credit assignment, Chapter 7's problem, and the value head from Chapter 5 is about to earn its keep. Second, and darker: when the number itself stops meaning what you think it means — when win-rate against a saturated benchmark keeps looking fine while the policy quietly forgets how to set Trick Room — no amount of training discipline saves you. The collapse in this chapter announced itself with a cratering metric. Chapter 10 is about the failure that didn't.

PPO algorithm

Nudge action probabilities toward outcomes, clipping each update (ratio within 1±0.2) so the policy stays near the one that gathered the data.

src/selfplay/trainer.py · ppo_update

Frozen league mechanic

Opponents are past snapshots (one added every 6 iters, sampled per batch) — never the live learner. Stable target, no strategy-cycling amnesia.

src/selfplay/selfplay_train.py

Best-by-confirm tool

Only a 200-game confirm that beats the previous best saves a checkpoint. Noisy 60-game spot checks promote nothing; a collapse can't lose a good build.

selfplay_train.py · main

KL divergence pitfall

Distance between probability distributions, used as an update guard. Read ~0.001 (target 0.02) through the whole collapse — argmax behavior flips at near-zero KL.

trainer.py · ppo_update (target_kl)

Check yourself

Why did a deterministic (greedy) rollout opponent cause a worse collapse than training against random?

A deterministic opponent makes the whole environment deterministic: the same position always draws the same response. PPO then optimizes what actually pays — one brittle line of play that exploits that fixed script. Entropy collapses, the policy becomes a memorized sequence, and it fails against any opponent that varies. Self-play against a greedy opponent went 80% → 7%, below the untrained net; switching the opponent to sample its moves was the load-bearing fix.

The KL guard targeted 0.02 and measured ~0.001 throughout the collapse. Why didn't it fire?

KL measures how far the probability distribution moved, but evaluation played greedy — always the argmax. When top actions are nearly tied, a tiny logit shift flips which one is the argmax: behavior changes drastically while the distribution barely moves. A healthy internal metric is not a healthy system; you have to watch behavior, not just internals.

During the historical collapse, why was the old 88% checkpoint still intact afterward?

The save rule required a 200-game confirm at ≥88%. The peak's confirms landed 86–87% — just under the bar — so no new checkpoint was ever written, and the collapse had nothing good to overwrite. That near-miss is why best-by-confirm checkpointing became a permanent fixture of the self-play trainer: never let a noisy spot check (or a later disaster) touch the last known-good build.