Chapter 8 · Part III — The craft

Measuring strength

Training produces checkpoints; measurement decides which one ships. In this project measurement is a real subsystem — src/evaluate/ — and most of the worst surprises in the whole build were measurement surprises: a benchmark that saturated, a rating that flipped rank between runs, and an evaluation mode that hid a 22-point improvement. This chapter is how the project learned to trust a number.

After this chapter you can explain

The two independent axes — correctness and strength — and why a checkpoint must clear both before promotion.
Why "% vs random" stopped meaning anything, and what Elo can and cannot resolve here.
How the paired pilot benchmark resolves ~1-point edges that Elo drowns in team-matchup luck.
Why every eval tool samples instead of taking the argmax — and the 22.4-point gain that greedy evaluation hid.

Two axes, and you need both

docs/evaluate.md opens with the frame that everything else hangs on. Evaluating a checkpoint has two independent axes, and neither substitutes for the other:

Correctness — does it still play the strategies and avoid clicks that are wrong under any theory of the game? Does it set Trick Room on a Trick Room team, mega-evolve, never attack into a fully type-immune foe? This is the job of the regression gate (src/evaluate/policy_regression.py), a deterministic pass/fail battery. Its full story — the disaster that forced its existence — is Chapter 10's; for now, know that it exists and that no checkpoint promotes without passing it.
Strength — does it actually win more than the incumbent it would replace?

The reason both axes are needed is that a checkpoint can pass one and fail the other, and the project has concrete examples of each. A raw RL base can be perfectly correct and weak. And mc7 — once the promoted default — hit 78% win-rate against its benchmark field while setting Trick Room in 0% of games on teams whose entire plan is Trick Room: strong-but-narrow, a gate failure hiding behind a great scalar. Promotion requires clearing both axes, full stop.

In plain terms

It's the difference between data-quality checks and business metrics. The gate is your dbt tests — schema intact, no nulls where nulls mean corruption — and a pipeline that doubles throughput while silently dropping a column must not ship, whatever the throughput dashboard says. Strength is the KPI. You'd never replace one with the other, and you'd never deploy on either alone.

Why vs-random lies

Early in the project, "win-rate vs random" — the policy against Showdown's built-in random-move player — was the number. Every capability step in Part II was first felt there. Then it stopped working, for the most ordinary of reasons: saturation. Once every serious checkpoint beats random ~90%+, the metric's ceiling is doing the discriminating, not the policies.

The killer example, straight from docs/evaluate.md: two ablation checkpoints both read 92% vs-random — indistinguishable — yet when they played each other, they rated Elo 94 vs 37. A real, decisive strength gap that the vs-random number literally could not see. It's still printed every training iteration, but as a cheap floor-check — "the run hasn't broken" — nothing more. Never promote on it.

Key point

A smoke test is not a benchmark. vs-random tells you the pipeline still runs; it cannot rank two healthy builds, because both have already maxed out what the test can measure.

Elo: an absolute ladder, with coarse resolution

The replacement for vs-random is an Elo ladder (src/evaluate/elo.py). Elo is the rating system from chess: each player carries a number, and the gap between two numbers predicts the win probability between them — equal ratings mean 50/50, a +100 gap means roughly 64% for the stronger side, and the ratings are updated after each game until they stabilize. Two properties make it right for this project. First, it's relative by construction — you learn A vs B by playing them, no external truth needed. Second, with one anchor it becomes comparable across time: here the random player is pinned at 0, so "Elo 309" means the same thing this month as last month, across generations of checkpoints that never coexisted.

The ladder, with the two axes visible at once. Real ratings from docs/roadmap.md: random anchored at 0, mc9 ≈220, policy_wide 244, mc7 301, policy_wider 309. mc7 outrates almost everything — and is dashed out because it fails the regression gate (it never sets Trick Room). wider is the first checkpoint to clear mc7's strength and pass the gate, which is why it's the deployed default.

Now the limitation, and it's a big one. In this project a game's outcome is dominated by team-matchup luck: both sides draw random teams from the 599-team corpus, and whether your draw counters their draw often matters more than who pilots better. The pilot's edge — the thing you're trying to measure — is a 1–3% ripple on top of that swell. Per docs/evaluate.md, ~300 games per pairing resolves only about a 6% difference; the project once watched mc2 and mc3 flip rank between two Elo runs. Practical rules: use --games 96 or more before acting on a number, and when the edge you care about is small — it usually is — reach for a sharper tool.

pilot_bench: the paired experiment

The sharp tool is pilot_bench (src/evaluate/pilot_bench.py), and it is the measurement centerpiece of the project. The move is one every data engineer knows: if noise comes from a nuisance variable, don't average over it — hold it fixed. pilot_bench freezes everything: the featured team, the opponent field, the seeds, the opponent's policy. Only one thing varies between the two arms: which policy pilots player one. Any difference down the column is pilot skill, because nothing else is free to differ. That pairing collapses the standard error to about 1 percentage point — a resolution Elo needs thousands of games to approach.

In plain terms

It's a matched-pairs A/B test. Instead of comparing this quarter's revenue (new pipeline, new customers, new season) against last quarter's — two noisy absolute numbers whose difference is mostly nuisance — you replay the same traffic through both versions and diff the outputs row by row. Difference-in-differences on identical inputs. The nuisance variance subtracts itself out.

This is the tool that resolved the project's decisive late-stage calls: policy_wide at 74.0% vs the default's 70.4% (+3.6 pp, SE 0.5) — a gap Elo saw only murkily as 260 vs 253 — and policy_wider at 84.5% vs the mc2 field, +8.3 pp over wide at SE 0.5.

Why pairing shrinks the error bars. Left: an unpaired comparison sums pilot skill with team-draw luck, so the clouds overlap and only a ~6% gap survives 300 games. Right: pilot_bench replays the same matchup rows with only the pilot swapped; each row is its own controlled experiment, and the difference down the column isolates piloting skill at ~1 pp standard error.

Two refinements complete the picture. First, the 2×2 de-confound: a win-rate shift when you swap policies could mean "the new policy pilots this team better" or "the new policy makes a worse opponent field." pilot_bench lets you vary the pilot and the field independently — run each candidate as pilot against each candidate as field — and the 2×2 grid splits the effect cleanly. Second, know which question a tool answers: src/evaluate/benchmark_team.py drives both sides with the same policy, so it measures how strong a team is in the meta, not how good a pilot is. Swap policies inside it and you've mixed the two questions again.

archetype_bench: win-rate, per strategy

One aggregate win-rate can hide a hole against a specific strategy. src/evaluate/archetype_bench.py breaks the number into per-archetype columns — sun, rain, Trick Room, tailwind hyper-offense — along two axes: --mode defend (the policy plays against each archetype) and --mode attack (the policy pilots each archetype). One clever detail makes the columns meaningful: a committer wraps the archetype side and forces its declared win-condition move when legal, deferring everything else to the policy. Without it, a "vs Trick Room" column would partly measure the field forgetting to set Trick Room at all; with it, the column measures the policy's counterplay against the strategy actually being executed.

This tool delivered one of the project's most clarifying findings: when the deployed strategy-preserving policy was compared to raw mc7, the gap was uniform — about +12 pp in mc7's favor across every archetype, random included. Not a Trick Room blind spot, not a rain problem: a flat, general strength gap. That single table redirected the whole roadmap from "coach specific matchups" to "raise raw strength." The full story of what that meant belongs to Chapter 10.

The gotcha that hid 22 points: sample, don't argmax

War story

For a while, evaluation played the policy "greedy": at each decision, take the argmax — the single highest-probability action — instead of sampling from the full distribution. Deterministic, reproducible, seemingly harmless. Then mc6 arrived, the first expert-iteration checkpoint, and under greedy evaluation its gain over mc5 measured a shrug: +1.1 pp. Under sampled play, the same pair measured mc5 45.6% vs mc6 68.0% — +22.4 pp. Greedy evaluation had hidden an entire checkpoint's improvement. Why: the distillation had moved probability mass, not the argmax — and setup lines like Trick Room live in the policy stochastically, at meaningful-but-not-top probability, so a greedy readout simply never plays them. Every eval tool now samples by default, and docs/evaluate.md lists "sample, don't --greedy" as gotcha number one. (What ExIt is and why mc6 exists is Chapter 9's story — the measurement lesson stands on its own.)

eval mode	mc5	mc6	gap
greedy (argmax)	50.1%	51.2%	+1.1 pp
sampled	45.6%	68.0%	+22.4 pp

Note the diagonal, too: greedy even flattered mc5 (50.1% vs its sampled 45.6%) while starving mc6. The evaluation mode wasn't adding uniform noise — it was systematically distorting the comparison.

Proxies lie, and one run is one sample

Two closing disciplines from docs/evaluate.md, both earned the hard way.

Rank on win-rate, never on a proxy. The project keeps behavioral probes — weather uptime, mega-evolution rate, setup-move propensity — and they are genuinely useful for explaining why a number moved. But they are never the number. Every single time a proxy disagreed with win-rate in this project, win-rate was right; the canonical case was a policy that kept its weather up longer — the proxy said better — and lost more games. Use probes to form the hypothesis, then confirm with a paired win-rate run.

One run per condition is a caveat, not a conclusion. Same-seed training runs make two conditions comparable — that's the ablation discipline from Chapter 7 — but a single sample still cannot cleanly separate a 1–3 pp aggregate effect from training-path noise. policy_wide's +3.6 pp rested on one seeded run, and the roadmap explicitly flags replicating on 2–3 seeds before building further on such a result. Declare winners after replication, not before.

For the curious

There's a hierarchy hiding in this chapter, and it's worth making explicit: vs-random (smoke test) → Elo (coarse absolute ladder) → pilot_bench (sharp paired comparison) → archetype_bench (diagnostic breakdown) → your own eyes on a game log. Each level trades generality for resolution. The craft is knowing which level your question lives at — "is this run broken?" is level one; "is this +2 pp real?" is level three; "why?" is level four.

The promotion checklist

Everything above compresses into the ritual from docs/evaluate.md that every would-be default must survive, in order:

The gate — mandatory, blocks promotion. policy_regression.py --ckpt <candidate>.pt must pass every blocking assertion. Deterministic and seeded, so a pass is real, not luck. No exceptions — this is the one hard requirement.
The Elo ladder — is it stronger, absolutely? Rate it against the current default, a couple of recent generations, and the random anchor, with --games 96 or more.
The pilot bench — is it a better pilot? If Elo is close (it usually is), resolve the call with the paired benchmark; use the 2×2 if you need to split pilot-side from field-side effects.
Spot-check with your eyes. Pilot the corpus's heaviest Trick Room team and watch it actually set Trick Room; run src/benchmarks/showdown.py for a readable game recap. Numbers first, but never numbers only — the entire Chapter 10 disaster would have been visible in one honestly-read game log.

Key point

Promote only what passes the gate and is at least as good as the incumbent on both Elo and pilot-bench. One scalar is never enough — the project's worst regression shipped through two promotions on a rising win-rate.

Check yourself

Two checkpoints both score 92% vs random. What do you actually know about them?

Almost nothing comparative — only that neither run is broken. vs-random saturates once every serious checkpoint beats it ~90%+, so it can't rank healthy policies: that exact pair rated Elo 94 vs 37 head-to-head, a decisive gap the smoke test was structurally unable to see. Rank with Elo or pilot_bench instead.

Elo and pilot_bench both measure strength. Why does the project need both?

They answer at different resolutions. Elo gives an absolute ladder comparable across generations (random anchored at 0), but game outcomes are dominated by which random teams each side draws from the 599-team corpus, so ~300 games resolve only ~6% — too coarse for the 1–3 pp pilot edges that matter. pilot_bench holds team, field, seeds, and opponent policy fixed and varies only the pilot, pairing the nuisance variance out to ~1 pp SE. Elo answers "is it stronger overall?"; pilot_bench answers "is it a better pilot, by how much?"

Why do all evaluation tools sample from the policy instead of playing its highest-probability action?

Because important behavior lives in probability mass below the argmax. Setup lines like Trick Room are encoded stochastically, and improvements can move mass without moving the top action — greedy evaluation measured mc6's gain over mc5 as +1.1 pp when sampled play showed +22.4 pp (45.6% vs 68.0%). Greedy argmax doesn't just add noise; it systematically hides exactly the behavior you most need to see.