Chapter 1 · Part I — The setup
The game and the goal
Before any code makes sense, you need to feel why this particular game is a nightmare for a computer — and what, exactly, this project promised to deliver. By the end of this chapter you'll be able to defend both.
- The Regulation M-B ruleset: doubles, level 50, bring-6 / pick-4, Open Team Sheets, Megas on, Tera off.
- The anatomy of one game, from team preview to the last faint — and why one game is only ~20 decisions.
- The four reasons this game is harder for a computer than chess, and the one mercy the rules grant.
- Why the product is a recommender, not just a bot — and why "random" became the baseline to beat.
The rules in ninety seconds
Pokémon Champions Regulation M-B is a competitive doubles format — "VGC" style, for Video Game Championships. Two players. Each brings a roster of six Pokémon at level 50, but each game they play with only four of them. Two of those four start on the field (the leads); the other two wait on the bench. So every game opens with a draft against yourself: which four of my six, and which two go out first?
Both full six-mon teams are visible to both players before the game starts. This rule is called Open Team Sheets, and it matters enormously — hold that thought for a few paragraphs.
Two more rule switches define the format. Mega Evolution is ON: once per game, one Pokémon holding its Mega Stone can transform mid-battle into a stronger form. Terastallization is OFF: the type-changing mechanic from recent games simply doesn't exist here (the Showdown mod hardcodes it away).
And the rule that changes everything: both players move at the same time. Each turn, you secretly choose an action for each of your two active Pokémon; your opponent does the same; then all four choices resolve together. There is no "I move, then you respond." Every turn is a sealed-envelope exchange.
| Rule | What it means in practice |
|---|---|
| Doubles, level 50 | Two active Pokémon per side, always. Positioning and targeting matter. |
| Bring 6, pick 4 | Every game starts with a 90-way draft decision (leads + back pair). |
| Open Team Sheets | Both full teams visible at preview. No hidden-roster guessing. |
| Megas ON (one per team) | One transformation per game — a timing decision the bot must learn. |
| Tera OFF | One whole mechanic removed. Fewer branches, mercifully. |
| Simultaneous turns | No turn-taking. Both sides commit blind, then everything resolves at once. |
Anatomy of one game
A game has three acts. First, team preview: you see both six-mon teams, you pick your four, and the order matters — the first two are your leads. Second, the turns: each turn, each side picks one action per active slot (a move with a target, a switch to a benched mon, possibly a Mega Evolution tacked on), and everything resolves simultaneously. Third, the end: the first player whose last Pokémon faints loses. In this project's environment that final result is the reward — the single number the learner is ultimately scored on: +1 for a win, −1 for a loss, and one full game is one episode, the unit of experience the learner collects.
Now put numbers on it. The team preview alone is a 90-way decision: which unordered pair of your six leads, times which unordered pair of the remaining four sits on the bench. Then each battle turn, your action is a joint action — one choice per active slot, chosen together. Each slot can pick any of its moves (with a choice of target — which foe, or sometimes your own partner), tack a Mega Evolution onto a move if eligible, or switch to a benched mon. Multiply the two slots' options and you get the turn's action space — the menu of everything you could legally do — which in this environment runs up to roughly 400 legal joint actions after de-duplication and a hard cap. And yet a whole game is only about 20 decisions long. Very wide, very short.
Why this is genuinely hard for a computer
Chess and Go fell to computers because, for all their depth, they are polite games: one player moves, the board is fully known, nothing is random. This game breaks all three assumptions at once. Four reasons, in rising order of pain:
1. Simultaneous moves kill classical search. Chess engines are built on minimax — "I move, you pick your best reply, I pick my best reply to that." That alternation is the load-bearing assumption. Here, both players commit at once, so every turn is a mind game: do I attack, or do they expect the attack and click Protect (a move that blocks everything aimed at that Pokémon this turn)? Do I Fake Out their fast attacker (a priority move that flinches its target — but only works the turn a mon enters), or do they switch it out and my Fake Out hits air? There is no "best move" in isolation, only a best move against a distribution of what the opponent might do. Rock-paper-scissors with 400 hands.
2. The game rolls dice constantly. Every attack's damage is drawn from 16 possible damage rolls. Moves have secondary effects that trigger by chance — a burn here, a flinch there. Critical hits happen. The same plan executed twice can produce two different games. In data terms: the environment is stochastic, so a single outcome tells you almost nothing; only distributions of outcomes mean anything. Remember that when we get to measurement in Chapter 8.
3. The action space is combinatorial. You're not choosing one move; you're choosing a coordinated pair, every turn, from a menu that can reach ~400 entries — plus that 90-way draft before turn 1. And choices interact: the best action for slot a depends on what slot b does (focus both attacks on one target and it faints; split them and both survive).
4. Payoffs are delayed. Some of the strongest plays in the format do nothing immediately. Trick Room spends your whole turn to reverse the speed order for the next four turns — slow Pokémon move first — which is the entire game plan of some teams. Tailwind spends a turn to double your side's speed for four turns. To a naive learner, these moves look like wasted turns: you paid now, and the payoff arrives two or three decisions later, laundered through everything else that happened. Connecting a turn-1 setup move to a turn-9 win is the credit assignment problem, and it haunts this project all the way to Chapter 10.
One mercy, though. Open Team Sheets means both full teams are on the table from the start — species, items, moves, spreads. Unlike classic Pokémon, there is no "what's their fourth mon holding?" detective work. The game is near full-information: the only things you don't know are the opponent's simultaneous choice this turn, and what the dice will do. That's still plenty, but it spares the system an entire layer of hidden-information machinery.
Simultaneous moves, dice, a combinatorial joint action space, and delayed payoffs — but no hidden teams. That exact profile is why this project reaches for learning instead of search, and it explains almost every design decision in the chapters ahead.
Why not just write the rules down?
Faced with a hard game, the tempting first move is a hand-written strategy: compute the damage of every option, click the biggest number. The project tried exactly that, early on. It is the single most instructive failure in the repo.
A greedy "always click the highest-damage move" heuristic was benchmarked against an opponent that clicks uniformly at random. The heuristic won about 22% of games — dramatically worse than a coin flip, against an opponent with no strategy at all. The cause is structural, not a bug: max-damage never Protects, never Fake Outs, never sets speed control, never redirects, never focus-fires — which is to say it ignores most of what actually wins VGC doubles. The lesson stuck: nobody on this project could hand-write the rules of good play, so the rules would have to be learned. And "random" — the thing that beat our best hand-written idea — was enshrined as the baseline every learned policy must clear. (docs/caveats.md #1.)
It's the classic hand-tuned-rules-versus-learned-model story from data work. The heuristic is a hard-coded CASE WHEN ladder someone wrote from intuition; it encodes one signal (damage) and silently drops five others (tempo, protection, speed, positioning, redirection). A learned policy — the function that maps a game state to a choice of action — is the model you fit when the feature interactions are too gnarly to enumerate by hand. The 22% result is the A/B test that killed the rules engine.
The goal: a recommender, not a trophy bot
Here is the part that reframes everything else in this course: the point of this project was never "a bot that wins games" as an end in itself. The deliverable is a recommender. Given your six and the opponent's six — both visible, thanks to Open Team Sheets — it should answer the questions a human player actually faces at the table:
- Which four do I bring? (the pick-4)
- Which two lead?
- What do I click on turn 1?
Plus one side product: logs of strong games — the system playing high-quality battles against itself, dumped as readable records to study (that's what self-play means: the same learner controls both sides, so it can generate unlimited games without a human opponent). The bot is the means; the recommendation is the product. It ships as src/benchmarks/recommend.py, and Chapter 11 walks through it — but every chapter between here and there is in service of making its answers trustworthy.
Which raises a question: recommendations against what? "The opponent" isn't one thing — the metagame clusters into recognizable game plans. The project tagged every team in its corpus (its library of 599 real tournament teams — Chapter 2 covers where they came from) with a primary archetype: a label for the team's win condition, derived automatically from its abilities, Mega choices, and move signatures. These archetypes are the axis the recommender answers along — "against this kind of team, bring these four."
weather mechanic
Sun, rain, sand, or snow teams: set the weather, then lean on Pokémon whose abilities and moves are supercharged under it. The most common plan in the corpus, matching the real metagame.
src/tools/team_taxonomy.pytrick_room mechanic
Reverse the speed order for four turns so your slow bruisers move first. The canonical delayed-payoff plan — and the one a careless training pipeline later destroyed (Chapter 10).
src/tools/team_taxonomy.pysemi_room mechanic
Trick Room plus redirection: one mon soaks attacks (Follow Me / Rage Powder) while the partner safely sets the Room. A hybrid, tagged with priority over plain trick_room.
src/tools/team_taxonomy.pytailwind_ho mechanic
Tailwind hyper-offense: double your side's speed for four turns and hit hard before the window closes. With weather, the other big cluster in the corpus.
src/tools/team_taxonomy.pybalance mechanic
The fallback label: no single committed win condition, just good Pokémon played flexibly. What a team is tagged when no higher-priority signature fires.
src/tools/team_taxonomy.pyThe tagging is rule-based and priority-ordered: weather > semi_room > trick_room > tailwind_ho > balance, with secondary tags kept alongside the primary. The taxonomy itself is stored as a single JSON row (archetype_meta) in data/champions.db, and the tagger lives in src/tools/team_taxonomy.py + src/tools/tag_teams.py. It was derived from team data — abilities, Mega-granted abilities, move signatures — not hand-labeled per team.
What one game looks like to the learner
Put the whole chapter in one picture: a game is a short sequence of decisions — one 90-way draft, then ~19 joint turn actions — with silence the whole way and a single verdict at the end.
Imagine tuning a 20-step data pipeline where the only observability is a green or red light after the final step — no per-stage metrics, no logs, and two of the stages are dice. That's the learning problem. And the "product requirement" isn't the green light itself: it's being able to tell a colleague which config to use for steps 1–3 against any given workload. Bot = pipeline that goes green; recommender = the config advice extracted from it.
Why can't a chess-style engine be pointed at this game directly?
Chess search (minimax) assumes alternating moves: I move, you see it, you respond. Reg M-B turns are simultaneous — both sides commit in secret and the choices resolve together — so "opponent's best reply to my move" isn't even well-defined. Add 16-roll damage variance and secondary effects, and the deterministic game tree that chess engines walk simply doesn't exist here.
The max-damage heuristic won ~22% against a random clicker. Why so much worse than a coin flip?
Because damage is only one of the signals that win doubles games, and hard-committing to it means never Protecting, never using Fake Out, never setting Trick Room or Tailwind, never redirecting, and never coordinating two attacks onto one target. A random opponent stumbles into those good plays occasionally; the heuristic never does. That's why the project abandoned hand-written rules and set random as the baseline for learned policies.
What does Open Team Sheets remove from the problem — and what uncertainty remains?
It removes hidden information about rosters: species, moves, items, and spreads for all twelve Pokémon are visible from team preview. What remains unknown is the opponent's simultaneous choice each turn and the outcome of chance (damage rolls, crits, secondary effects). Near full-information, but still a mind game played with dice.