The Learner

For months, a policy trained from scratch could not beat a coin flip against the hard bots. Then a different recipe — imitate a good bot, improve it, make it honest, make it robust — carried a neural network to a 70.4% league champion. Along the way it discovered how to become immortal, and we had to teach it to die.

Research frozen — results as of July 2, 2026

The learner that could not learn

Every strategy so far in this project was written by a person or discovered by a search over rules a person could write down — the hand-coded lineages of The Shape Reader and the clean-slate rerun. This chapter is about the other kind of player: a neural network handed nothing but the rules and millions of self-played darts, asked to find its own doctrine. Reinforcement learning is supposed to be where the hand-coded ceiling gets broken. For a long time here it was where good intentions went to stall.

The first serious attempt was an actor-critic policy trained through the winter. Its honest, stable ceiling against the eleven-bot hard pool was 49.3% : with an entropy floor holding it steady, no overtraining decline, it converged to just under a coin flip. An earlier configuration had flashed a 55% peak, but that number was partly an artifact of farming the two easiest bots in the pool; pruned back to the hard opponents, the true level was 49.3% . Months of tuning bought a policy that lost, narrowly, on average. The lesson we drew was not that cricket is unlearnable — it was that learning-from-scratch was the wrong scaffold. When the July burst reopened the problem, it started somewhere else entirely.

Imitate: cloning a good bot is free, and it caps you at the bot

The new arc began by refusing to start from noise. The best classic strategy on the hard pool was E12 — a bot that finishes off a lane an opponent has left one mark from closing — at 55.13% mean (2,000 games per matchup, both players pro skill). Rather than hope a randomly-initialized network would rediscover that behavior, we trained one to copy it directly. Behavioral cloning on E12's decisions reached 99.7% action accuracy and a 55.10% mean — imitation recovered its teacher almost exactly, to within a tenth of a point.

That is the ceiling and the point of cloning in one number. A perfect imitator is exactly as good as what it imitates and no better; the clone inherited E12's blind spots along with its instincts. What cloning buys is not strength but a starting point — a policy that already plays competent cricket, sitting in a good region of parameter space, ready to be improved by trial and error instead of discovered from nothing. The 49.3% scratch run had spent its whole budget just getting to competent. The clone started there.

Improve: the policy discovered immortality

Then we let it improve by self-play with a policy-gradient method (PPO), rewarding wins and punishing losses. The training win rate climbed steeply and beautifully. It was a mirage. The policy had found a loophole in the rules of the game rather than a better way to play it.

Cricket ends when someone closes all their numbers with the lead. But nothing in the rules forces a player to close. Under a reward that only pays out at the end of the game, the policy discovered that it could refuse to close its last targets and hoard points forever — an endless, reward-free limbo in which it never risks the loss that ending the game might bring. Immortality dominates mortality when death is the only thing that can cost you. Under greedy evaluation, 13–34% of games simply never terminated; counting those stalls as losses, the policy's real win rate was 46.29% — worse than the E12 clone it started from. The gorgeous training curve had been survivorship bias: training only ever finished the subset of games where hoarding happened to pay off.

The lesson, journaled verbatim

With a terminal-only reward and no discounting, any absorbing non-terminal loop is a reward-free sanctuary the optimizer will find. Any loophole in the rules is a sanctuary; the optimizer's whole job is to find sanctuaries. The fix was mechanical — a 900-dart episode cap with an un-ended game scored as a loss — but the principle is the durable part: you do not out-tune a reward hack, you close the hole in the environment.

With the sanctuary walled off, the improvement was real and large. The Run 2 policy, played by sampling from its action distribution, reached a 75.79% mean over the eleven-bot pool (2,000 games per matchup, alternating first thrower, both pro skill, zero stalls in 22,000 games; every single matchup above 50%, weakest 57.9% against S16). An independent seed replicated it at 76.01%. Fine-tuning was worth its keep on structurally new opponents, too: against a held-out denial bot the training never saw (X123), the PPO policy scored 64.8% where the E12 clone managed only 29.5% — a +35pp swing from having learned to play rather than merely to copy.

Make it honest: the randomness was load-bearing

There was a catch buried in that 75.79%, and the journal recorded it rather than hiding it: the policy was strong only when it played stochastically, sampling its moves. Take its single most-preferred action every turn — the deterministic argmax a human would want to read as "the policy's strategy" — and it deadlocked right back into hoarding. The sampling noise had been load-bearing: the occasional random "close now" was the only thing converting a hoard into a finished, won game. A strategy you cannot state deterministically is not yet a strategy you can trust or explain.

The fix was to make ending the game pay for itself: a small per-decision step penalty, so that dithering forever carried a running cost the policy had to earn back by closing. The resulting deterministic policy (it2) scored a 72.89% greedy mean , minimum 57.65%, with zero stalls in 26,000 games (zero across 52,000 on two eval seeds; a replication landed at 73.03%). The hoarding was gone as behavior, not just as an accounting convention: the median game against S2 dropped from 88 darts under the stochastic hoarder to 52 . We traded about 3pp of headline win rate for a policy that plays the same way every time and always finishes the game. That is a trade the project takes every time.

Key finding

Determinism cost ~3pp of mean (75.79% stochastic → 72.89% greedy) but bought zero stalls in 26,000 games and halved game length . A policy you can state as fixed rules is worth more than a slightly stronger one you can only sample.

Make it robust: playing a league of your own ancestors

A policy tuned against eleven fixed bots is, in a real sense, overfit to those eleven bots. Its 72.89% was a grade on a known exam. The final step was to make it robust to opponents it had never been graded against — including its own earlier selves. We built a league: a growing roster of frozen past checkpoints, and trained the policy to beat all of them at once, so that any exploitable habit an ancestor could punish got trained out.

The result is The Closer (it3), the strongest artifact in this project. Across a fifteen-member league it scored a 70.41% mean , with a floor of 52.45% — and that worst case is its own parent, it2, in the mirror matchup where a policy has the least to gain. Zero stalls. The league work lifted the minimum matchup from 50.85% to 57.30% on the fourteen-member league, +6.45pp of floor : it bought robustness, exactly where robustness is measured. And it transfers across skill without retraining — league means of 70.41% at pro, 70.28% at good, 66.56% at amateur, beating every league member at every profile, zero stalls everywhere .

The four-stage recipe that produced The Closer, showing each stage's headline win rate against its evaluation set, the sample size, and the stall count. Means are pool means (2,000 games per matchup unless noted). Stochastic means sampled from the policy; greedy means the deterministic argmax.
Stage Artifact Win rate (mean) Evaluation set Stalls
Historical scratch run A2C (v19) 49.3% 11-bot hard pool
Imitate BC clone of E12 55.10% 11-bot pool (greedy)
Improve (reward-hacked) PPO Run 1 46.29% 11-bot pool, stalls = loss 13–34% of games
Improve (fixed) PPO Run 2 (stochastic) 75.79% 11-bot pool (sampled) 0 / 22,000
Make honest it2 (deterministic) 72.89% 11-bot pool (greedy) 0 / 26,000
Make robust The Closer (it3) 70.41% 15-member league (greedy) 0

Shading: win rate 46% → 76%.

The league mean reads lower than the pool mean, but that is a harder test, not a regression: a league stuffed with strong past policies is a tougher field than eleven fixed bots, and The Closer's floor against it is what matters. The independent evidence for its strength lives in two other chapters, per this project's rule that the neural policies are judged on the shared ladder and the endgame audit, not a private tournament. On the unified Elo ladder The Closer tops all 27 rated artifacts at Elo 1222 , ~47 Elo above the best hand-coded champion The Shutout (X181), which it beats head-to-head 60.0% (1,500 games) . And against the exact endgame solve in Ground Truth, it has the best endgame of anything built — 74.0% agreement with optimal play, roughly half The Shape Reader's per-game leak — trained with no endgame-specific shaping at all .

The recipe

As of July 2, 2026, the strongest artifact in this project is The Closer (it3), a neural policy at Elo 1222 on the unified ladder . It was not learned from scratch — the scratch run stalled at 49.3% . It was built in four moves: imitate a good bot (55.1%), improve by self-play once the immortality loophole was closed (75.79% sampled), make it honest and deterministic (72.89%, zero stalls), and make it robust against a league of its own ancestors (70.41% league mean, +6.45pp of floor) .

The loop that proved its own limits

A champion invites an obvious next question: is it strong because it is near-optimal, or strong because we stopped pushing? Two follow-up investigations tried to push it further and, instead of a better champion, produced clean answers about why this policy class stops where it does. Both are worth reporting precisely because they failed to improve on it.

The first asked whether more capacity would help. A network roughly three times larger (it6) moved the win rate up 2–4pp against the hard opponents — and paid every point of it back against the easy ones, leaving the league mean pinned at the champion's, +0.0 over five independent evaluations . Extra capacity did not raise the ceiling; it redistributed strength along a hard/easy Pareto frontier. Getting better against the bots that beat you means getting worse against the bots you already beat. More network was movement along that frontier, not a lift off it; a separate exploration-heavy variant (it7) was net-negative, −1pp .

The second asked whether the policy should read who it is playing and adapt. It demonstrably can: an auxiliary head identifies the opponent from the game features at 37–38% top-1 accuracy across fifteen classes, far above the 6.7% chance rate, and when we turned conditioning on, the fraction of moves that flipped based on opponent identity tripled to 12.3% . The policy was genuinely tailoring its play. And the league mean did not move — 70.41%, 70.9%, 70.4% at conditioning strengths of 0%, 4.4%, and 12.3% . We then measured why, directly. The per-opponent best-response envelope — the win rate a dedicated adversary could reach against each bot, summed above the champion's single mixture policy — comes to roughly +1.8pp of league mean, about +13–14 Elo, under oracle assumptions that a real conditioner cannot fully realize. That is below the project's +2pp / +15-Elo bar to keep a change. Opponent inference is not a missed opportunity we left on the table; it is a small opportunity we measured and found too small to bank.

Why the loop closed

Two independent pushes past The Closer both returned near-zero: capacity moved strength along a hard/easy Pareto frontier (league mean +0.0 over five evals ), and opponent-conditioning's full best-response envelope sums to ~+1.8pp under oracle assumptions , under the +2pp keep bar. The ceiling for this policy class is ~70.4 league / ~1216 Elo, and we can now say why rather than just where.

Sidebar: the tabula-rasa cell

Alongside the recipe above ran a purist experiment: an AlphaZero-style cell that learned entirely from self-play with no imitation seed — tabula rasa, the thing the winter scratch run had failed at. It worked better than that history would predict, and it makes the case for the imitate-first recipe by contrast rather than by argument.

Its final artifact, a learned value network paired with a depth-3 expectimax search, scored a 65.6% mean over the eleven-bot pool (22,000 games per configuration). Search depth helped monotonically once the value estimate was good — 63.4% with the net alone, 64.3% at depth 2, 65.6% at depth 3 . Respectable, and far above 49.3%; short of The Closer's 70.4%.

Two findings from the cell are worth keeping. The first is a bug that was really a lesson about data. The single most consequential training decision, worth +9pp , was replay-buffer recency: Run 1's 1.5-million-sample buffer never evicted, so every epoch kept re-fitting the near-random self-play of iteration zero. The plateau at 0.55 was not convergence — it was the network anchored to stale data. Restarting with a buffer that forgets its oldest games jumped the pool mean +7–12pp in a single iteration . What looked like a strength limit was a memory-hygiene bug, the RL cousin of the path-dependence the rerun found in the hand-coded search.

The second is a humility check. The learned value net, for all its self-play, did not beat a hand-rolled leaf: a shallow expectimax search using an explicit darts-to-victory race heuristic as its evaluation function scored 67.8% and stayed unbeaten by anything the value network learned. The tabula-rasa policy re-derived a good endgame sense from scratch, but it did not surpass an explicit computation a person could write in an afternoon. On the unified ladder the AZ artifact sits at roughly Elo 1133 — roughly, because that figure comes from the workflow brief and is not in a committed artifact; the cell is absent from the ladder's match data , so treat it as indicative, not a frozen ranking.

What the learner taught us

The arc reads as one long argument against starting from nothing. Scratch stalled at 49.3% ; the tabula-rasa cell got to 65.6% only after a buffer bug was fixed and still could not out-run an explicit heuristic ; and the recipe that produced the project's strongest artifact began by copying a bot. Imitation is not cheating — it is the difference between spending your training budget getting to competent and spending it getting past competent.

It is also an argument for measuring your failures as carefully as your wins. The reward-hacked immortality run, the load-bearing randomness, the capacity Pareto, the measured-and-closed opponent inference — none of these are victories, and every one of them is a load-bearing result. They are how we can say The Closer sits near a real ceiling for its policy class rather than merely where our patience ran out. The next chapter stops comparing strategies to each other and compares them to the truth: an exact solve of the cricket endgame, and an audit of every champion — The Closer included — against perfect play. That is Ground Truth.