Ground Truth

Every champion in this project was rated against other champions. Here we rate them against perfection: an exact solve of the darts-cricket endgame, and an audit of how far each strategy strays from it.

Research frozen — results as of July 2, 2026

A ladder ranks players against each other. This ranks them against the truth.

Every strength claim on this site up to now has been relative. A win rate says one strategy beats another; an Elo rating places a strategy on a scale built entirely out of those head-to-heads. That is honest, but it has a ceiling: a field of players can all share the same blind spot, and no amount of playing them against each other will reveal it. To find the blind spots you need an outside reference — a move that is simply correct, independent of who is at the board.

For the endgame, that reference now exists. We solved the closing phase of darts cricket exactly: for every reachable position near the finish, we computed the true probability that the player to throw eventually wins, assuming both sides play perfectly from there on. The solved region covers 4.16 billion states — every combination of marks, darts remaining, and score differential inside the closing phase. With those numbers in hand, "what should you throw?" stops being a matter of doctrine and becomes a lookup. And "how good is this champion, really?" becomes a measurement: replay its real games, and every time it enters a solved position, compare what it threw to what perfection would have thrown.

Two findings anchor this chapter. First, the strongest artifact in the project — the neural policy we call The Closer (it3) — turns out to have the best endgame of anything we built, agreeing with perfect play 74.0% of the time despite never being trained on the endgame at all (71,961 scored decisions). Second, the old site's hero — The Shape Reader (X188) — has an endgame it cannot defend: it forfeits 5.2 win-probability points every game inside the solved slice (52,415 decisions), ranking 10th of 13 audited strategies. Its ladder standing, as of July 2, 2026, is carried by its opening and midgame, not its close.

The headline

The endgame is solved — 4.16 billion exact win probabilities . Scored against it, The Closer (it3) leaks 2.7 win-probability points per game ; The Shape Reader (X188) leaks 5.2 . The best endgame in the project belongs to the policy that was never taught one.

Solving 4.16 billion positions, then proving the solve

Cricket does not have a small endgame. But it has a tractable one. Once most of the board is closed, the number of things that can still happen collapses, and the game becomes a race with a scoreboard attached. We defined the closing phase as any position with openness ≤ 6 — at least 8 of the 14 lanes already closed — inside a score band of ±250, both players on the pro accuracy profile. Within those bounds the position space is finite and enumerable: 2,767,444 mark patterns × 3 darts remaining × 501 score differentials = 4.16 billion states , stored as 16.6 GB of 32-bit win probabilities. We solved it shell by shell, working inward from fully closed positions (openness 0) out to openness 6, each shell a self-contained, checkpointed tablebase.

An exact solve is only useful if it is actually exact, so the harder work was proving it. Four independent validation gates all passed: a mechanical-equivalence check against the game engine, analytic spot-checks on positions with known answers, a boundary re-solve of the smaller shells, and — the decisive one — a Monte-Carlo gate that took 20 solved states, played 100,000 games each from those states against the live simulator, and checked the observed win rate against the tablebase's predicted probability. The worst disagreement across all 20 was |z| = 2.63, with zero failures — exactly the spread you expect from sampling noise around a correct number, not a systematic error.

Validation also earned its keep by catching a real bug. Under small sequential seeds, the raw random-number generator biased the very first dart of a game; the fix was a splitmix64 hardening step at the wrapper boundary . It is worth saying plainly: the solve found a flaw in the tooling that prior experiments had quietly run on top of. That is what a ground-truth reference is for.

What "solved" means here

For any position in the closing phase (openness ≤ 6, score within ±250), the tablebase returns the exact probability the player to throw wins under perfect play by both sides — validated to within sampling noise by 20 states × 100,000 games, worst |z| = 2.63, zero fails . It is a lookup table for correctness, not a heuristic.

The audit: every champion, scored against perfection

With an exact reference in hand, we can stop asking whether a strategy beats its rivals and start asking whether it plays the endgame correctly. The method is deliberately unglamorous. Thirteen strategies — the neural policy, the rerun champions, the main-lineage bots, and two baselines — each played 6,000 games (2,000 against each of three opponents), and every decision that landed in a solved position was scored against the tablebase's exact win probabilities . That is 52,000 to 75,000 scored decisions per strategy. Each decision earns an agreement flag (was the chosen throw one of the exactly-optimal ones, tie-aware) and an EV loss in win-probability points (how much win probability the choice gave away versus the best available throw).

The cleanest single number is EV lost per game: sum a strategy's in-slice losses within each game, average over all 6,000 games. It answers a concrete question — if we bolted a perfect endgame onto this strategy and left the rest untouched, how much win probability would we stop hemorrhaging? Ranked that way, the order is a verdict, and it does not match the ladder.

The Closer (it3) wins the endgame outright. It is best on every metric: 74.0% agreement, an EV leak of just 2.7 win-probability points per game (95% CI 2.6–2.8), and a blunder rate of 0.38% — one catastrophic error in roughly 260 in-slice decisions . Every hand-built rerun champion — X123, The Shutout (X181), X129, X168 — beats The Shape Reader here. And the two classic baselines that anchor the bottom, S2 and E12, leak roughly four times what The Closer does. The pattern is that the neural lineage's edge is not only midgame cleverness; a large share of it is endgame skill it was never explicitly taught.

Agreement with exact optimal endgame play, all 13 audited strategies ranked high to low. The Closer (it3) leads at 74.0%; no strategy plays the endgame perfectly. Agreement measures match rate, not EV loss — the table below ranks by win-probability lost per game.
Endgame audit: 13 strategies ranked by win-probability lost per game against the exact tablebase, best (lowest leak) first. Each strategy played 6,000 games; 52,000–75,000 decisions scored per strategy. Columns: agreement with optimal play, EV lost per game with 95% bootstrap confidence interval, and blunder rate. Color in the agreement column encodes agreement magnitude.
Rank Strategy Lineage Agreement EV lost/game (95% CI) Blunder rate
1 The Closer (it3) RL (PPO) 74.0% 0.027 (0.026–0.028) 0.38%
2 X123 clean-slate 58.0% 0.032 (0.030–0.034) 1.15%
3 H1_Hoarder clean-slate 39.0% 0.037 (0.034–0.040) 2.07%
4 The Shutout (X181) clean-slate 67.7% 0.040 (0.038–0.042) 1.17%
5 X129 clean-slate 57.2% 0.043 (0.041–0.045) 1.67%
6 X168 clean-slate 62.9% 0.043 (0.041–0.045) 1.09%
7 PS baseline 51.9% 0.048 (0.044–0.051) 2.48%
8 X147 clean-slate 63.5% 0.050 (0.047–0.052) 1.69%
9 X159 clean-slate 63.9% 0.051 (0.048–0.053) 1.85%
10 The Shape Reader (X188) main 61.6% 0.052 (0.049–0.055) 3.42%
11 X109 main 57.3% 0.058 (0.056–0.061) 4.04%
12 S2 baseline 43.8% 0.099 (0.094–0.104) 4.63%
13 E12 main 43.2% 0.104 (0.099–0.109) 4.68%

Shading: agreement 43% → 74%.

One subtlety keeps the table honest: agreement and EV loss measure different things. H1_Hoarder agrees with the tablebase only 39.0% of the time, yet leaks less per game than The Shutout at 67.7% agreement . The reason is that most of H1's disagreements are near-ties — moves that are technically not optimal but cost almost nothing — while other strategies' totals are dominated by rarer, larger errors. Match rate tells you how often a strategy is wrong; EV loss tells you how much being wrong costs. Rank by cost.

A lead makes everyone sloppy

The most robust behavioral pattern in the whole audit is also the most counterintuitive: strategies play their best endgame when they are losing, and their worst when they are winning. Break each strategy's agreement down by score position and the shape is universal. Behind by more than ten, agreement runs high — up to 90.2% for The Closer . Ahead by more than ten, every strategy collapses into the 52–62% band , no matter how strong it is overall.

The mechanism is clean, and it explains why this survived so many rounds of tuning undetected. When you are behind, the intuitive move and the correct move coincide: you chase, you close, you claw back — and the tablebase agrees, because with the deficit doing the work, racing to close is genuinely optimal. When you are ahead, intuition and truth diverge. The champions switch into close-out mode and race to shut their own remaining lanes. Perfect play does the opposite: it keeps scoring and denying on the opponent's open lanes, widening the margin first, and only closes when closing actually wins. Every hand-built strategy learned the losing player's endgame well and the winning player's endgame badly — because in self-play against similar opponents, the winning-side states are the ones your own chase heuristic rarely stress-tests.

The lead trap (n = 52,000–75,000 decisions/strategy)

Agreement with perfect play when behind: up to 90.2% (The Closer). Agreement when ahead: 52–62% for every strategy in the field . A lead is the exact situation in which champions stop playing well — they close out when they should keep scoring on the opponent's lanes.

Agreement with optimal play by score position, for four representative strategies. Columns show agreement when behind by more than ten, level (within ten), and ahead by more than ten. Drawn from the endgame audit, 52,000–75,000 scored decisions per strategy. Color encodes agreement magnitude.
Strategy Behind (<−10) Level (±10) Ahead (>+10)
The Closer (it3) 90.2% 83.5% 61.7%
The Shutout (X181) 84.7% 75.1% 58.6%
X168 85.2% 73.1% 51.4%
The Shape Reader (X188) 66.5% 61.2% 57.3%

Shading: agreement 51% → 90%.

The Shape Reader is the revealing exception. Its agreement barely moves across score position — 66.5% behind, 61.2% level, 57.3% ahead — because its endgame is roughly equally mediocre everywhere, rather than good-when-behind and bad-when-ahead. It does not have a lead problem so much as a general endgame problem, which the next section makes concrete.

Ties go to the closer — and almost nobody knows it

Buried in the rules is a fact that turns out to decide games: at an even score, the player who closes the final lane wins. A tie is not a tie — it is a won position for whoever finishes first. Perfect play exploits this ruthlessly. When you are level with one lane one mark from done, a plain single closes that lane and wins the game on the spot. There is nothing more valuable on the board.

Most champions do not know this. The single largest error in all of recorded play — across every strategy, every game, every scored decision — is a strategy standing on an even score, one dart from winning, and throwing at the bull to pad a score that is already good enough. The tablebase catalogued it exactly. X109 forfeits 0.563 win probability in one throw: level score, all lanes closed except its 16 (needing one mark) and the opponent's bull, one dart in hand. It played a double-bull (win probability 0.403). The optimal throw was a single 16 (win probability 0.966) — any hit on 16 closes the last lane and wins immediately, because the closer takes the tie . It threw at the bull to pad a lead it did not need and handed the opponent a full turn, one mark from closing.

This is not a lone bug. It is a doctrine error, and The Shape Reader has the identical blind spot. Its third-worst catalogued blunder is the same shape: forfeit 0.542 at an even score with its 15 one mark from done, one dart in hand — it played double-bull; the optimal throw was a single 15 that wins on contact . X188's win-finish routine was inherited early and never corrected: it aims a double at the bull where a plain single at its own last lane would win the game. Over all 260 catalogued exemplars, this family — closing (or padding) into a position that does not win, when a different throw wins outright — accounts for 188 of 260 ; scoring when it should close accounts for 55, and closing the wrong lane for 17.

There is a piece of history worth pinning to this exact error. The tie-goes-to-the-closer rule — the "win-tap" — was not something the main lineage ever fully learned in 115 experiments. It was rediscovered from the rules alone, at iteration 3 of the independent rerun, and kept immediately . The tablebase then certified, decisively, that its absence is The Shape Reader's single most expensive blind spot. The rerun found the law; the solve proved why it matters.

The worst throw ever recorded (certified)

At an even score, one dart from closing the last lane, X109 threw at the bull: win probability 0.403. The optimal single would have closed the lane and won outright: 0.966. A 0.563 win-probability forfeit on a single dart — the largest single-decision error in all of recorded play, and the same blind spot The Shape Reader carries (forfeit 0.542).

A gallery of catalogued blunders

The audit stored the worst single-decision forfeits as concrete positions — the exact marks, the score, the darts remaining, what the strategy threw, and what perfection would have thrown. Six of them, drawn straight from the catalog, show the whole taxonomy of endgame error in one place. In each: means a lane is closed, (n) means n marks so far on an open lane; the win-probability gap is what the chosen throw gave away.

1 · X109 — forfeit 0.563 (the worst throw recorded)

Me: all lanes closed except 16 (2 marks). Opponent: all closed except bull (1 mark). Even score, 1 dart. Played double-bull (0.403). Optimal: single 16 (0.966) — any hit closes the last lane and wins on the spot, because ties go to the closer. Padding an already-sufficient score instead handed the opponent a full turn.

2 · The Shape Reader (X188) — forfeit 0.542

Me: all closed except 15 (2 marks). Opponent: all closed except bull (1 mark). Even score, 1 dart. Played double-bull. Optimal: single 15 — closes and wins immediately. The identical tie-goes-to-the-closer blind spot that X109 has, in the main-lineage champion.

3 · PS — forfeit 0.550

Me: all closed except 20 (2 marks). Opponent: all closed except bull. Even score, 1 dart. Played double-bull. Optimal: single 20 — wins immediately. The same motif in a plain baseline bot; the error is not a quirk of one lineage but a gap in how nearly everyone treats the tie rule.

4 · S2 — forfeit 0.502

Me: all closed except bull (2 marks). Opponent: all closed except 15 (2 marks). Even score, 1 dart. Played triple-15 — padding +45 on the opponent's open lane. Optimal: single bull — closes my last lane at an even score and wins. Scoring on the opponent when a single throw finishes the game.

5 · X168 — forfeit 0.443

Me: 19 (1 mark), rest closed. Opponent: bull (2 marks), rest closed. Ahead by 12, 1 dart. Played double-bull. Optimal: triple-19 — with the lead already sufficient, two more marks close the last lane and win outright on a double or triple, and a single still progresses. The double-bull merely pads a lead that was already big enough while leaving 19 open and the opponent one bull mark from racing back.

6 · H1_Hoarder — forfeit 0.495

Me: bull (2 marks), rest closed. Opponent: 20 (0 marks), rest closed. Behind by 1, 1 dart. Played double-bull. Optimal: triple-20 — closing the bull while behind does not win (the differential is still negative at close-out) and leaves nothing to do; triple-20 on the opponent's open lane flips the differential to +59 first. Fix the deficit before the close, not after.

Read together, the gallery is one lesson repeated: the score differential is a resource you fix before the final close-out, not after. Perfect play grinds the opponent's open lanes until the margin is safely positive and only then closes; the champions race to close first — usually the bull — and arrive at a close-out that does not win. That single reversal is the dominant leak in the entire field.

Bolting on a perfect endgame: how much would it buy?

The audit measures error; a behavioral test measures what fixing it is worth. We built a hybrid of each top champion — its own policy everywhere out of the closing phase, the tablebase's exact move inside it — and played that hybrid 50,000 games against the plain, unmodified champion (plain-versus-plain is 50% by symmetry; the 95% interval is ±0.44pp) . The gap is the pure value of a perfect endgame, with everything else held fixed.

The result lands exactly where the audit predicts, and it flips the ranking. The strategy with the most to gain from a perfect endgame is The Shape Reader: +4.4 percentage points against itself , from touching only about 7.5 darts per game (50,000 games). That single number reframes the whole chapter — X188's recoverable endgame headroom is roughly the size of the entire X123-versus-X188 championship margin measured on the ladder. The Closer, already near-perfect in the endgame, has the least to gain (+2.76pp): there is little left to fix.

Hybrid-champion behavioral test: each champion's win rate when it plays the exact tablebase move inside the closing phase and its own policy elsewhere, against the plain unmodified champion, 50,000 games each, both players pro skill, alternating starters, ±0.44pp 95% interval. Endgame gain is the improvement over the 50% plain-versus-plain baseline. Color encodes win rate.
Champion Hybrid win rate Endgame gain Tablebase darts/game
The Shape Reader (X188) 54.40% +4.40pp 7.5
The Shutout (X181) 53.54% +3.54pp 11.0
The Closer (it3) 52.76% +2.76pp 9.5
X123 51.40% +1.40pp 9.5

Shading: win rate 51% → 54%.

The hybrid gains preserve the lineage ordering for the three top-rated champions and land in the range the audit predicted from their measured EV leak, which is the confirmation that matters: the abstract "win probability forfeited per game" translates directly into games won. X123's small measured gain (+1.4pp) is itself informative — its endgame errors concentrate in states its own careful style rarely reaches, which is consistent with it having the lowest hand-built EV leak in the field .

Ground truth's verdict

As of July 2, 2026, the endgame is solved and every champion is measured against it. The Closer (it3) has the best endgame in the project (74.0% agreement, 2.7pp/game leak) despite never training on it. The Shape Reader (X188) has the most recoverable headroom (+4.4pp from a perfect endgame alone) — its ladder standing rests on the phases of the game it plays before the close, not the close itself.