Chapter 7 · Part II — The learner

The credit problem

A game is about twenty decisions, and the only feedback is one number at the very end: +1 or −1. This chapter is about how the trainer figures out which of those twenty decisions deserved it — and about the time a carefully designed penalty turned out to do nothing at all.

After this chapter you can explain

Why a sparse end-of-game reward makes learning slow, and what "credit assignment" means.
How potential-based reward shaping (PBRS) pays out partial credit each turn without changing what "winning" means — and why the telescoping trick is the whole point.
What GAE does, why it was worth +36 Elo, and why the real lever was variance, not new information.
Why the −0.3 ally-KO penalty was inert, and how a controlled ablation proved it.

Twenty decisions, one number

Recall the shape of the training signal from Chapter 3. The reward for an entire game — one episode, roughly twenty decisions after the bridge auto-plays the forced ones — is a single ±1 delivered at the final turn. Every turn before that, the reward is exactly zero. Not "small." Zero.

Now put yourself in the trainer's shoes. A game just ended +1. Somewhere in those twenty decisions was the turn-2 Trick Room that flipped the speed order and won the whole endgame. Also somewhere in there was a pointless turn-5 switch that accomplished nothing. Both decisions sit in a trajectory that ended +1, so both get pulled toward "do this more." The signal is real — over millions of games, good moves appear in winning games slightly more often than bad ones do — but it is diluted twenty ways and drowned in luck. This is the credit assignment problem, and it is the central difficulty of reinforcement learning with sparse rewards.

In plain terms

Quarterly revenue came in 4% up. Your team shipped twenty pipeline changes last quarter. Which one did it? You have exactly one aggregate number and no per-change attribution. You could ship another million quarters and let statistics sort it out — that's what raw sparse-reward RL does — or you could instrument intermediate metrics (data freshness, join coverage, dashboard latency) that move the day a change lands. The two fixes in this chapter are exactly those two instruments: a per-turn progress metric, and a smarter way to average over the noise.

Discounting, briefly

One knob you should know exists: the discount factor γ. Rewards are shrunk slightly for every step of distance between a decision and the payoff — a move ten turns before the win gets credit multiplied by γ¹⁰. In this project γ = 0.997 (trainer.py), deliberately close to 1: games are short, and the earliest decision of all — the pick-4 draft at team preview — matters as much as anything that happens later, so its credit must arrive nearly undiscounted. At γ = 0.997, a reward twenty turns away still keeps about 94% of its value. Discounting here is a gentle tiebreaker toward faster wins, not a real horizon.

Fix #1 — PBRS: pay for progress, not just for winning

The first fix is potential-based reward shaping (PBRS), implemented in the potential() and assign_returns() functions of src/selfplay/trainer.py. The idea: invent a position score Φ — a single number for "how good is this position for me, right now." This project's Φ is a hand-written sum of four-and-a-bit terms:

Material mechanic

Your six mons' total HP minus the foe's, divided by 6. The dominant term — being ahead on health is most of what "winning position" means.

src/selfplay/trainer.py · potential()

Speed control mechanic

0.3 × the TR/Tailwind-adjusted speed bits. Setting Trick Room while your team is slow flips these bits — and Φ jumps the moment you do it.

src/selfplay/trainer.py · potential()

Boosts & screens mechanic

0.3 × net stat boosts (Swords Dance et al.) plus 0.15 × net screens (Reflect, Light Screen). Setup work becomes visible progress.

src/selfplay/trainer.py · potential()

Weather, benefit-aware pitfall

0.1 × (my live weather-abusers − the foe's), clipped to ±2. The word "benefit-aware" is a scar; the war story below explains it.

src/selfplay/trainer.py · potential()

With Φ in hand, every turn's reward gets a shaping term added: Φ(state after) − Φ(state before), scaled by a weight (SHAPE = 0.5). Do something that improves the position and you get paid that turn. Set Trick Room while slow and the speed bits flip: an immediate credit of about +0.6, landing on the exact decision that earned it, instead of a diluted share of a ±1 fifteen turns later.

Same destination, denser signal. Top: the raw reward — nineteen zeros and a terminal ±1, spread across every decision. Bottom: PBRS adds a small Φ-difference each turn (blue up, pink down), so the turn-2 Trick Room is paid on turn 2. The terminal ±1 is unchanged; the shaping only redistributes credit along the way.

Here is the subtle point — and it is the entire reason the word "potential-based" matters. Credit is only ever given as a difference of the position score. Everything you gain on entering a good state, you give back the moment you leave it: enter a Trick-Room-up position, collect +0.6; when Trick Room expires four turns later, Φ drops and you pay it back. Sum the shaping terms over any full game and they telescope — the middle terms cancel pairwise, leaving only Φ(end) − Φ(start), which is a constant that doesn't depend on how you played. So no strategy can farm the shaping. The best strategy under the shaped reward is provably the same as the best strategy under the real ±1 (this is a theorem, due to Ng, Harada and Russell). Φ is a denser progress bar toward the same goal, not a different goal.

Contrast that with the naive version: "+0.1 every time you KO something." That is not a difference of anything, so it doesn't cancel — a policy can learn to farm KOs against sacrificial fodder, drag games out to collect bonuses, and drift away from actually winning. Potential differences structurally can't be farmed. That's the trade: you get to inject a coaching opinion into training, and the math guarantees the opinion can only change how fast the policy learns, not what it ultimately converges to.

Key point

PBRS turns one end-of-quarter number into a per-turn progress metric — and because the credit is a difference of a position score, it telescopes: it can accelerate learning but provably cannot change what the optimal strategy is.

Measured effect: the shaped-vs-unshaped ablation rated 83 vs 64 Elo — about +19 Elo for shaping, per docs/roadmap.md. Useful, real, not enormous. The bigger prize was still to come.

War story

"Provably can't change the goal" does not mean "can't mis-coach along the way." The first weather term in Φ credited simply having your own weather up. Sounds harmless — until you watch a rain team without a live rain abuser being piloted into spending turns maintaining rain that benefits nobody, because upkeep itself was being paid. It caused a measurable regression on rain teams. The fix (checkpoint policy_mc5) made the term benefit-aware: weather is credited by how many of your live mons actually exploit it (boosted STAB type or a payoff ability) minus the foe's — the weatherBenefit observation. The benefit-aware version beat the baseline by +3.6 percentage points across the metagame (full saga: archive/benchmarking.md, notes in docs/tech-debt.md). The lesson generalizes: every shaping term is a coaching opinion about what "progress" is, and a wrong opinion coaches wrong play for as long as training lasts. PBRS guarantees the destination, not the detours.

Fix #2 — GAE: blending reality with the critic

The second fix attacks a different enemy: noise. Recall from Chapter 5 that the network has a value head — a side output that estimates, from any position, how likely we are to win from here. In RL jargon that estimator is called the critic. The learning update in PPO doesn't actually push on raw returns; it pushes on the advantage of each move: how much better did things go after this decision than the value head expected? Chose a move from a 50/50 position and ended up clearly winning — positive advantage, do it more. Ended up losing — negative, do it less.

But "how did things actually go" can be measured two ways, and both are flawed:

Actual outcome only. Follow the real trajectory to the end and use what really happened. Honest — but drenched in noise: damage rolls, critical hits, the opponent's own luck. A perfect move followed by three bad rolls reads as a mistake.
Critic only. Take one real step, then trust the value head's estimate of everything after. Smooth and low-noise — but only as good as the critic, and the critic is a small network trained on its own imperfect play. Its errors become your errors, systematically.

GAE — generalized advantage estimation — refuses to pick a side. It blends the two along the trajectory: trust short-horizon reality (the next few actual steps), plus the critic's estimate of the rest, with a knob λ controlling the blend. λ = 0 is pure critic; λ = 1 is pure actual outcome; this project runs λ = 0.95 (assign_returns() in src/selfplay/trainer.py) — mostly reality, smoothed by the critic. The critic also bootstraps credit backward through the game, which is how the very first decision — the pick-4 — receives usable credit without the terminal reward having to propagate raw through every intervening turn.

The λ knob blends two flawed estimators. An illustrative trajectory, credited three ways. Pure actual-outcome credit (top) is unbiased but jagged with luck; pure critic credit (middle) is smooth but inherits the value head's errors; the λ = 0.95 blend (bottom) keeps the trend of reality with far less of the noise. Same underlying game every time — only the accounting differs.

The result in this project: GAE was the single largest strength gain of the entire credit-assignment thread. In a controlled head-to-head, the GAE-trained policy rated Elo 80 vs 44 for the ablated (no-GAE, unshaped) run — +36 Elo (docs/roadmap.md, docs/architecture.md). And notice what the lever actually was: GAE added no new information. Same games, same rewards, same features. It reduced the variance of each update — less luck contaminating each gradient step, so every batch of games taught more. In data terms: you didn't collect new data, you fixed the aggregation so the same data stopped lying to you per-batch.

Key point

The biggest single strength gain in the project's credit thread came not from new signal but from less noise per update. Variance reduction is a first-class lever, not a footnote.

The instructive failure: the ally-KO penalty

One more story, because failed levers teach as much as successful ones. Problem: the policy would sometimes Earthquake — a spread move that hits everyone adjacent, partner included — while its own healthy ally stood next to the epicenter. Goal: stop KOing your own partner. The obvious RL-shaped fix: a −0.3 reward penalty on any transition where a chosen move KOs a healthy, exposed ally (ALLY_KO_PEN, wired through assign_returns).

It was tested properly: a controlled ablation — two from-scratch training runs, identical in every respect (same seed, same recipe) except that one had the penalty and one didn't. Both runs converged to roughly 18% ally-KO propensity. The penalty did nothing. The real improvement — down from the earlier default's 25% — had already come from something else entirely: fixing a bug in the allyHit feature, which had been telling the policy that single-target moves like Knock Off could somehow hit the partner (details in docs/regression.md). Perception fixed it; punishment didn't. The penalty code was reverted (docs/roadmap.md §7).

Why was it inert? Ally-KOs are rare. A −0.3 penalty on an event that appears in a small fraction of games produces a gradient that is negligible against the pressure of millions of ordinary decisions — the signal is simply too small a fraction of the total loss to steer anything. Two lessons worth keeping:

Rare events need a big lever or a different mechanism. A much larger coefficient (risking distortion of legitimate Earthquake use), upweighting the rare transitions, or stepping outside RL entirely — masking the action or a decision-time rule, the way the fully-immune-attack blunder was fixed. Until one works, ally-KO stays report-only in the regression gate.
You only know the lever did nothing because the ablation was controlled. Change one thing, keep the seed, compare. Without the paired run, the drop from 25% to 18% would have been credited to the penalty, the belief would have shipped, and the real cause — the feature fix — would have gone unrecorded. It's the same discipline as an A/B test with a held-out control: no control, no attribution.

For the curious

Notice the asymmetry with PBRS. Trick Room setup is also "rare-ish," yet the Φ speed-control term coaches it fine — because Φ pays on the common, every-turn currency of position (speed bits are always part of the state), not on a rare event. The ally-KO penalty fired only on the rare event itself. Where you attach the signal matters as much as its size.

Check yourself

Why can't a policy "farm" PBRS shaping the way it could farm a +0.1-per-KO bonus?

Because PBRS credit is always a difference of a position score: whatever Φ you gain entering a state, you give back when you leave it. Over a whole game the terms telescope to Φ(end) − Φ(start), which no in-game strategy can inflate — so the optimal strategy under the shaped reward is provably identical to the optimal strategy under the real ±1. A flat per-KO bonus is not a difference of anything, so it accumulates and can be gamed.

GAE added no new information to training. Where did the +36 Elo come from?

Variance reduction. The advantage of each move was previously estimated from the full actual outcome, which is saturated with damage-roll and opponent luck. GAE(λ=0.95) blends short-horizon reality with the value head's estimate of the rest, so each gradient update carries the same signal with much less noise — and the same volume of games teaches more per batch.

The ally-KO ablation showed ~18% ally-KO with the penalty and ~18% without. What conclusion was justified, and what made it trustworthy?

The justified conclusion: the −0.3 penalty was inert, and the earlier improvement from 25% to 18% came from the allyHit feature fix, not the penalty. It was trustworthy because the comparison was a controlled ablation — two from-scratch runs identical except for the one flag, same seed — so the difference (or absence of one) was attributable to the penalty alone.