Reference

The lookup wing. Every rated strategy on one table, the protocols that produced the numbers, the raw data files, and a register of every citable fact on this site — each with the stable ID the other chapters link to.

Research frozen — results as of July 2, 2026

Every number on this site traces to one fact sheet

This site makes quantitative claims — rankings, win rates, Elo ratings, head-to-head results — and every one of them was transcribed from a single facts sheet, itself read from committed research artifacts on July 2, 2026, the project's freeze date. Each fact carries a stable ID (F-ELO-IT3, F-TB-TIEBLUNDER, …). When a chapter states a load-bearing number, it links that ID to this page; the anchors land either on the row of a table below or on the entry in the fact register at the bottom. If a claim has no fact ID, it is context, not evidence.

Two conventions apply everywhere. First, results are frozen: "champion" means champion as of July 2, 2026, and no page updates that date. Second, every win rate carries its sample size, because a win rate without an n is a rumor.

The catalog: 27 strategies, three lineages, one scale

The table below is the project's roster, ordered by the unified Elo ladder that put all three lineages — the classic catalog, the hand-coded discovery champions, and the neural policies — on one scale (One Ladder tells the story). Ratings are Bradley–Terry maximum-likelihood fits over 253 C-engine pairings (4,000 games each) plus 58 Python-engine pairings (1,500 games each), anchored at S1 = 1000 . As of July 2, 2026, the top of the ladder belongs to the reinforcement-learning policies: The Closer (it3) rates 1222 , about 47 Elo clear of the best hand-coded strategy, The Shutout (X181), which it beats 60.0% head-to-head (1,500 games) .

A scope note on coverage: the classic catalog is larger than its ladder representation. The project's pre-discovery field held 30 strategies — the 17 Frongello S-bots plus experimental E-variants — but only the representatives that later work trained against or audited were rated. Per the naming convention, five artifacts carry style-of-play names ; everything else goes by its research ID.

The unified Elo ladder: all 27 rated strategies with lineage, Elo rating, standard error, rated games, and a one-line description, as of July 2, 2026.
# Strategy Lineage Elo ±SE Games One line
1 The Closer (it3) RL 1222.0 2.8 24,000 PPO league self-play policy; strongest artifact in the project; 74.0% agreement with the exact endgame solve
2 it2_det RL 1215.7 2.7 24,000 Determinism milestone: step-penalty fine-tune that ended point-hoarding — 0 stalls in 26,000 eval games
3 PPO_run2 RL 1205.8 2.7 24,000 Stochastic PPO policy (75.79% pool mean); its argmax deadlocks — randomness was load-bearing
4 The Shutout (X181) Clean slate 1175.5 1.7 94,000 Best hand-coded strategy: race-gated banking, faucet denial, min-miss bull taps; first 18/18 home-pool sweep
5 X168 Clean slate 1173.3 1.7 94,000 X168_LastLaneED — added the lineage's sixth extra-darts redirect site
6 X159 Clean slate 1167.1 1.7 94,000 X159_BrokeDenyED, the anti-champion: beat then-champion X147 head-to-head
7 X147 Clean slate 1162.2 1.7 94,000 X147_BrakeTap45 — hoard brake at −45 plus min-miss taps
8 X129 Clean slate 1135.1 1.7 94,000 Endgame-gated extra-darts redirect; beat X188 at every skill profile on the cross-bench
9 The Shape Reader (X188) Main 1117.5 1.7 88,000 The 89-iteration main-lineage champion and the old site's hero; detector-branch chase denial
10 X109 Main 1113.0 1.7 88,000 The main-lineage keep X188 improved on; owner of the audit's largest catalogued blunder (0.563 win prob at a tie)
11 X123 Clean slate 1107.3 1.6 94,000 Denial + descending-value walk + extra-darts redirect; the first cross-lineage milestone (+2.66pp over X188, same seed)
12 The Textbook (DIST) RL distillation 1085.2 2.6 24,000 The neural champion's doctrine as human-readable rules; beat all seven classics it played, loses ~62% to the X1xx champions
13 X165 Main 1072.7 1.7 88,000 Home-pool keep that overfit its frontier pool — ~40 Elo below X109 on the unified field
14 X112 Main 1070.8 1.7 88,000 Home-pool keep, same overfit pattern as X165
15 H1_Hoarder Clean slate 1052.2 1.7 88,000 Point-hoarder injected as a "weak" pool trainer; outrates every classic baseline
16 S14 Classic 1036.6 1.6 94,000 Chase variant; top classic S-bot on the unified ladder
17 S10 Classic 1031.7 1.6 94,000 Chase variant, second of the S-family pair atop the classics
18 E12 Classic 1018.3 1.6 94,000 FinishOppClosed1: best classic bot, seed of the main lineage and the RL teacher — and the most exploitable audited artifact (82.1%)
19 X103 Main 1018.3 1.7 88,000 The main lineage's founding entry, seeded from E12
20 E10 Classic 1013.7 1.7 88,000 Experimental variant of the classic catalog
21 S2 Classic 1013.6 1.6 94,000 Score/cover at lead 0; strongest classic S-bot at every skill level of the old MPR sweep
22 E3 Classic 1012.3 1.6 94,000 Experimental variant of the classic catalog
23 S1 Classic 1000.0 88,000 Pure closer; the ladder's fixed anchor (Elo 1000 by definition)
24 E1 Classic 992.4 1.7 94,000 Experimental variant of the classic catalog
25 E11 Classic 989.7 1.7 88,000 Experimental variant of the classic catalog
26 PS Classic 986.7 1.7 88,000 Points-first baseline
27 S6 Classic 949.5 1.7 94,000 Extra-darts classic; bottom of the rated field

The fit is honest about its own error: the mean gap between observed head-to-head results and the Bradley–Terry prediction is 5.08pp, with a worst residual of 16.53pp (S10 vs E1: observed 39.1% where the model predicts 55.6%) . And the field is unusually well-behaved — an intransitivity scan over every ordered triple found zero rock-paper-scissors cycles in which every edge clears 52% (≈2.5σ at 4,000 games): to measurement precision, the 27 strategies form a total order .

Selected head-to-heads

A sample of observed pairwise results from the ladder's match data; row beats column at the stated rate. Raw win/loss counts for all 311 pairings are in ladder.json.

Selected head-to-head win rates from the unified ladder match data: 4,000-game C-engine pairings or 1,500-game Python pairings; cell color encodes the winner's rate.
Matchup (winner first) Win rate Source
The Closer (it3) vs S6 90.4% F-ELO-H2H
The Shutout (X181) vs PS 84.4% F-ELO-H2H
The Closer (it3) vs S2 81.7% F-ELO-H2H
The Shutout (X181) vs X165 74.5% F-ELO-H2H
The Closer (it3) vs X123 65.3% F-ELO-H2H
The Closer (it3) vs The Shutout (X181) 60.0% F-ELO-RLTOP (1,500 games)
X129 vs The Shape Reader (X188) 57.0% F-ELO-H2H
The Shutout (X181) vs The Shape Reader (X188) 54.6% F-ELO-LINEAGEGAP
The Closer (it3) vs it2_det 54.5% F-ELO-RLTOP (1,500 games)
it2_det vs PPO_run2 52.9% F-ELO-RLTOP (1,500 games)
The Shape Reader (X188) vs X109 50.8% F-ELO-H2H
X123 vs The Shape Reader (X188) 50.6% F-ELO-H2H

Shading: head-to-head win rate 51% → 90%.

Off the ladder: rated by other instruments

Five artifacts matter to the story but do not appear on the unified ladder, either because they compete on a different grid (skill-mismatched play), because no Elo fit was ever committed, or because they were superseded before the ladder existed. Their headline numbers come from their own benchmark protocols and are not directly comparable to ladder Elo.

Artifacts not on the unified Elo ladder, with their arm, headline benchmark, and the reason they are unrated.
Artifact Arm Headline number Why it is not on the ladder
The Grinder (X214_UD) Clean-slate underdog program 17.63% mean underdog win rate vs the X181-clone baseline's 15.31% — a 15% relative gain; seed-7 validation 17.65% Specialist for playing up a skill class; rated on the underdog gap grid, not the even-skill ladder
AZ value_iter1 (run 2) AlphaZero cell 65.6% mean over the 11-bot pool with depth-3 expectimax (22,000 games/config, SE ≈ 0.32pp) No committed Elo fit exists; the cell is absent from the ladder's match data
X107_PSChaseMild8 Opus control arm 55.2% mean / 51.7% min on its 11-bot pool, beating all 11 10-iteration budget champion; cross-benched at 50.9% mean on the main expanded pool
bc_e12 RL arc (behavior cloning) 55.10% greedy pool mean — a 99.7%-accurate clone that exactly recovers its teacher E12 (55.13%) Stepping stone to PPO, superseded within the arc
A2C v19 Historical RL (Feb 2026) 49.3% stable win rate vs 11 hard bots The pre-PPO ceiling; superseded by the 2026-07 BC→PPO→it3 arc

Methodology: how the numbers were made

Benchmarks and error bars

Two simulators produced every result: a C engine (fast_sim.c) for high-volume tournaments and the original Python engine. Games alternate the first thrower and use the pro/pro skill profile unless a fact says otherwise. The standard cell sizes and their 95% Monte-Carlo intervals :

Benchmark cell sizes and their 95 percent Monte-Carlo intervals.
Cell Games 95% MC interval
C-engine ladder pairing 4,000 ±0.8pp
Python ladder pairing 1,500 ±1.3pp
RL / exploitability eval 2,000 ±1.1pp
Hybrid endgame match 50,000 ±0.44pp

Key results were replicated under a second seed before being trusted: the ladder ran on seed 42 ; The Shutout's home-pool sweep reproduced under a seed-7 re-bench (64.40% / 50.41% vs 64.40% / 50.44%) , as did The Grinder's underdog result (17.65 / 13.22 vs 17.63 / 13.31) , and the PPO pool mean replicated on an independent seed at 76.01% vs 75.79% .

The ladder fit

The unified ladder is a Bradley–Terry maximum-likelihood fit with anchor S1 = 1000 and scale 400/ln 10, over 253 C-engine pairings (a full round robin at 4,000 games each) plus 58 Python-engine pairings (1,500 games each, 0 stalls). Standard errors come from inverse Fisher information with the anchor fixed. Cross-engine trust rests on port validation: clean-slate champions ported into the C engine were verified bit-identical to their home engine, and Python ports agreed within 1.5pp of C against a common opponent . Independent corroboration: the RL arc's own keep-rule protocol pegged it3 at Elo 1215.9 from different match data — a 6-Elo (≈0.9pp) gap from the ladder's 1222.0 .

Stall accounting

Cricket has a degenerate corner: a player can refuse to close its last targets and farm points forever. The RL arc's first PPO run found and abused it — 13–34% of greedy-eval games deadlocked, and counting those stalls as losses dropped the run's mean to 46.29% . Every frozen result since uses the fix and reports the count: a 900-dart episode cap with stall scored as a loss. The ladder's Python pairings ran 0 stalls ; it2_det ran 0 stalls in 26,000 eval games (0 in 52,000 across two seeds) , and it3 ran 0 stalls ; the exploitability probes ran 0 stalls in all cells . A "0 stalls" note next to a result is that accounting, not a boast.

The tablebase solve and its gates

The exact endgame solve (Ground Truth) covers the slice of cricket with at most 6 of 14 lanes still open and score difference within ±250, both players on the pro profile: 2,767,444 mark patterns × 3 darts × 501 score diffs = 4.16 billion states, 16.6 GB of exact win probabilities, solved shell-by-shell so each openness shell is a usable checkpoint on its own . Four validation gates all passed: mechanical equivalence against the game engine, analytic spot-checks, a boundary-bracket re-solve of the inner shells, and a Monte-Carlo gate of 20 solved states × 100,000 games each (worst |z| = 2.63, 0 fails) . Validation earned its keep: it caught one real bug — biased first-dart RNG seeding under small sequential seeds, fixed by splitmix64 hardening (commit 8b49948) .

The endgame audit

With the solve in hand, 13 strategies were audited: 6,000 games each (2,000 vs each of X188, CS_X181, and S2), with every in-slice decision — 52,000 to 75,000 per participant — scored against the exact win probabilities. "Agreement" is tie-aware (a choice within 10−6 of the best move counts as agreeing); a "blunder" forfeits more than 0.05 win probability in one decision . Together with the unified ladder, this audit is the evidence base for the neural policies — there is no policy tournament grid.

Endgame audit ranking of 13 strategies by expected value lost per game against the exact tablebase, with agreement rate and blunder rate; 52,000 to 75,000 scored decisions per participant. Color in the agreement column encodes agreement magnitude.
Rank Participant Lineage Agreement EV lost/game (95% CI) Blunder rate
1 The Closer (it3) RL 74.0% 0.0272 (0.0261–0.0283) 0.38%
2 X123 Clean slate 58.0% 0.0319 (0.0300–0.0339) 1.15%
3 H1_Hoarder Clean slate 39.0% 0.0369 (0.0342–0.0398) 2.07%
4 The Shutout (X181) Clean slate 67.7% 0.0400 (0.0379–0.0421) 1.17%
5 X129 Clean slate 57.2% 0.0428 (0.0407–0.0450) 1.67%
6 X168 Clean slate 62.9% 0.0430 (0.0406–0.0453) 1.09%
7 PS Baseline 51.9% 0.0475 (0.0436–0.0511) 2.48%
8 X147 Clean slate 63.5% 0.0497 (0.0472–0.0521) 1.69%
9 X159 Clean slate 63.9% 0.0507 (0.0484–0.0532) 1.85%
10 The Shape Reader (X188) Main 61.6% 0.0519 (0.0494–0.0547) 3.42%
11 X109 Main 57.3% 0.0583 (0.0556–0.0611) 4.04%
12 S2 Baseline 43.8% 0.0990 (0.0943–0.1040) 4.63%
13 E12 Main 43.2% 0.1039 (0.0990–0.1090) 4.68%

Shading: tablebase agreement 39% → 74%.

Read the ranking by EV lost, not by agreement: H1_Hoarder agrees with the solve only 39% of the time yet leaks less per game than The Shutout at 67.7% agreement, because most of its disagreements are near-ties . A second instrument from the same solve: hybrid matches, where a champion plays itself except that in-slice decisions are replaced by tablebase-greedy play (50,000 games each, ±0.44pp; plain-vs-plain is 50% by symmetry):

Hybrid bot versus plain champion win rates over 50,000 games each: the endgame headroom of four champions.
Champion Hybrid win rate Endgame gain Tablebase darts/game
The Shape Reader (X188) 54.40% +4.4pp 7.5
The Shutout (X181) 53.54% +3.5pp 11.0
The Closer (it3) 52.76% +2.8pp 9.5
X123 51.40% +1.4pp 9.5

Exploitability probes

How much can a dedicated adversary punish each strategy? For five frozen targets, a PPO best-response was trained against that target alone (warm-started from the champion) and evaluated at 2,000 games per mode, stalls counted as losses, 0 stalls observed . Lower is harder to exploit: the attack barely beats a coin flip against The Closer but wins 82% against E12.

Trained best-response win rates against five frozen targets, greedy and sampled evaluation, 2,000 games per mode; cell color encodes the adversary's win rate.
Target Best-response WR (greedy) (sampled)
The Closer (it3) 53.2% 53.0%
The Shutout (X181) 59.2% 57.3%
X129 61.1% 60.8%
The Shape Reader (X188) 72.1% 69.7%
E12 82.1% 81.2%

Shading: best-response win rate 53% → 82%.

Data downloads

Two data files ship with this site, both copied verbatim from committed research artifacts on the freeze date. Everything a chapter claims about the ladder or the endgame audit can be recomputed from them.

Data files shipped with the site, their contents, and provenance.
File Contents Provenance
ladder.json Ladder metadata (protocol, port validation), the 27-entry rating table with standard errors, and all 311 raw match records (strategies, wins, games, engine, seed) Verbatim copy of autoresearch_strategies/elo_matches.json, 2026-07-02, unmodified
endgame_audit.json The 13-participant audit ranking (agreement, EV loss with 95% CI, blunder rate, breakdowns by darts/diff/openness) plus the 50,000-game hybrid match results Field values assembled unmodified from tablebase/audit_out/*.json; lineage labels per tablebase/AUDIT.md

The previous site's data trees are retained at their old paths for link stability; the chapters on this site transcribe only from the frozen facts sheet, not from legacy JSON.

The fact register

Every citable fact on this site, by stable ID. Chapter links land here (or on the catalog and audit rows above, which carry their own IDs). Statements are condensed; the frozen facts sheet in the repository holds the full wording and the source path for each. All results as of July 2, 2026.

Named artifacts

Fact register: the five named artifacts.
Fact IDStatement
F-NAME-SHUTOUTThe Shutout = X181 (clean-slate X181_RaceBull): best hand-coded strategy — race-gated banking, faucet denial, min-miss bull taps.
F-NAME-CLOSERThe Closer = it3 (it3_league_ckpt0600, PPO league iteration 3, greedy): strongest artifact in the project.
F-NAME-TEXTBOOKThe Textbook = DIST: interpretable rule-bot distilled from the PPO policy.
F-NAME-GRINDERThe Grinder = X214_UD (X214_UD_Brake4): underdog (outgunned-player) champion.
F-NAME-SHAPEREADERThe Shape Reader = X188 (X188_X186DetectorChaseDenial): 89-iteration main-lineage champion, the old site's hero.

The unified ladder

Fact register: unified Elo ladder facts. Per-strategy ratings anchor to the catalog rows above.
Fact IDStatement
F-ELO-METHOD27 artifacts rated by Bradley–Terry MLE, anchor S1 = 1000, scale 400/ln 10; 253 C-engine pairings (4,000 games each, pro/pro, seed 42) + 58 Python pairings (1,500 games each, 0 stalls); C ports validated bit-identical, Python ports within 1.5pp; SEs from inverse Fisher information.
F-ELO-FITModel fit: mean |observed − predicted| edge error 5.08pp; max residual 16.53pp (S10 vs E1: observed 39.1%, predicted 55.6%).
F-ELO-RLTOPThe RL policies top the ladder; it3 clears X181 by ~47 Elo and beats it 60.0% head-to-head (1,500 games); it3 > it2_det 54.5%, it2_det > PPO_run2 52.9%.
F-ELO-LINEAGEGAPEvery clean-slate champion from X129 up (1135–1176) rates above main-lineage champion X188 (1118); X181 beats X188 54.6% (top-of-lineage gap ~58 Elo).
F-ELO-OVERFITWithin the main lineage only X188 improved on X109 on the unified field; X112 (1071) and X165 (1073), both home-pool "keeps", rate ~40 Elo below X109 (1113) — overfit to a fixed frontier pool.
F-ELO-DIST-STORYDIST (1085) recovers about a third of the Elo gap from the classic baselines to its teacher PPO_run2 (1206); beats all seven classic representatives it played (55.7–70.9%), loses ~62% to the X1xx champions.
F-ELO-H1-STORYH1_Hoarder (1052), injected as a "weak" pool trainer, outrates every Frongello baseline.
F-ELO-CYCLESZero intransitive cycles clear a >52% bar (≈2.5σ at 4,000 games): the 27-strategy field is, to measurement precision, a total order. 30 near-cycles exist inside the >50% noise band; strongest: E11 → S1 → S14 → E11 (weakest edge 51.9%).

Endgame audit

Fact register: endgame audit facts. The full ranking and hybrid tables anchor above.
Fact IDStatement
F-TB-METHOD13 participants × 6,000 games each (2,000 vs each of X188, CS_X181, S2; pro/pro, alternating starters); every in-slice decision (openness ≤ 6) scored against exact win probabilities; agreement tie-aware (< 10−6); blunder = > 0.05 forfeited; 52k–75k decisions per participant.
F-TB-IT3The Closer (it3) has the best endgame of anything built: 74.0% agreement, 2.7pp/game leak (71,961 decisions), 0.38% blunder rate — half X188's leak, one-ninth its blunder rate, with no endgame-specific training.
F-TB-X188WEAKX188 ranks 10/13 in the endgame slice, forfeiting 5.2 win-probability points per game (52,415 decisions, 3.42% blunder rate); every clean-slate champion beats it there. Its rating is carried by opening/midgame play.
F-TB-AHEADAgreement collapses when ahead: 52–62% for everyone (vs up to 90.2% when behind — it3's behind-agreement). Behind, chase heuristics coincide with optimal play; ahead, optimal play keeps scoring and denying while champions race to close.
F-TB-TIEBLUNDERLargest catalogued single-decision forfeit: X109 gives up 0.563 win prob at a tied score with one dart (played double bull EV 0.403; optimal S16 EV 0.966 — the single wins because ties go to the closer). X188 has the identical blind spot (forfeits 0.542: played DB, optimal S15).
F-TB-MOTIFOf 260 catalogued exemplar blunders (loss > 0.05, top-20 per participant): closed-when-should-score 188, scored-when-should-close 55, closed wrong lane 17. Dominant leak: throwing at one's own last open lane (usually the bull) when behind or level.
F-TB-AGREEMENT-VS-EVAgreement % and EV lost measure different things: H1_Hoarder agrees only 39% yet loses less per game than X181 at 67.7% agreement — most of its disagreements are near-ties. Rank by EV, not match rate.

The tablebase solve

Fact register: exact tablebase solve facts.
Fact IDStatement
F-TBSOLVE-SIZESolved slice: openness ≤ 6, diff band ±250, pro/pro — 2,767,444 mark patterns × 3 darts × 501 diffs = 4.16 billion states, 16.6 GB of f32 values, solved shell-by-shell.
F-TBSOLVE-GATESAll validation gates passed: mechanical equivalence, analytic spot-checks, boundary-bracket re-solve of shells ≤ 4, and a Monte-Carlo gate (20 states × 100,000 games; worst |z| = 2.63, 0 fails).
F-TBSOLVE-SEEDBUGOne real bug found in validation: raw xoshiro seeding biased the first dart under small sequential seeds; fixed by splitmix64 hardening (commit 8b49948).

Exploitability

Fact register: exploitability probe facts. Per-target rates anchor to the exploitability table above.
Fact IDStatement
F-EXP-READINGLower best-response win rate = harder to exploit. A PPO adversary trained solely against it3 reaches 53.2% — barely better than a coin flip; the same attack beats X188 72% of the time and E12 82%. (2,000 games/mode, stalls-as-losses, 0 stalls in all cells.)

The RL arc

Fact register: reinforcement learning arc facts.
Fact IDStatement
F-RL-BASELINEBest heuristic on the 11-bot hard pool: E12 at 55.13% mean (2,000 games/matchup, pro/pro).
F-RL-BCBC clone of E12 (99.7% action accuracy): 55.10% greedy mean — imitation exactly recovers its teacher.
F-RL-RUN1PPO run 1 failed by reward hacking: refusing to close and hoarding points indefinitely; 13–34% of greedy games deadlocked; stalls-as-losses mean 46.29%. Fix: 900-dart cap, stall = loss.
F-RL-PPOPPO run 2 (stochastic): 75.79% mean over the 11-bot pool (2,000 games/matchup, 0 stalls in 22,000 games, every matchup > 50%); seed replication 76.01%. Caveats: its argmax deadlocks, and strength is partly pool overfitting.
F-RL-HELDOUTVs held-out X123 (never in training): PPO 64.8%, BC clone 29.5% (2,000 games) — fine-tuning worth +35pp on a structurally novel denial opponent.
F-RL-IT2it2_det (step-penalty fine-tune): greedy mean 72.89%, min 57.65%, 0 stalls in 26,000 games (0 in 52,000 across two seeds); hoarding gone — median game vs S2 is 52 darts, was 88.
F-RL-IT3it3 (league self-play): 15-member league mean 70.41% greedy, min 52.45% (vs its parent it2), 0 stalls; floor gain +6.45pp on the 14-member league. Anchored Elo 1215.9 on the keep-rule protocol, consistent with the ladder's 1222.0 (Δ6 Elo ≈ 0.9pp).
F-RL-SKILLGRIDThe champion transfers across skill without retraining: league mean 70.41% pro / 70.28% good / 66.56% amateur, 0 stalls, beats every member at every profile (500 games/matchup).
F-RL-PARETOA 3×256 capacity jump moved the hard band +2–4pp but paid it back on the easy band — league mean flat over 5 evals. The ceiling is a hard/easy Pareto frontier (~70.4 league / ~1216 Elo for this policy class); extra exploration was net-negative (−1pp).
F-RL-INFERENCEOpponent inference closed by measurement: features carry identity (37–38% top-1 over 15 classes vs 6.7% chance), the policy conditions on it (12.3% of greedy actions flip), yet league mean stays flat (70.41 / 70.9 / 70.4).
F-RL-BRENVELOPEThe per-opponent best-response envelope above the champion sums to ≈+1.8pp league mean (≈+13–14 Elo) under oracle assumptions — below the +2pp/+15-Elo keep bar. Measured: BR-vs-S14 +3.95pp, BR-vs-S16 +2.5pp, BR-vs-X129 +1.6pp.
F-RL-A2CHistorical A2C ceiling (Feb 2026): 49.3% stable WR vs 11 hard bots (v19, entropy floor 0.01). Superseded by the 2026-07 arc: 49.3% → 75.79% pool mean → 70.41% deterministic league champion.

The clean-slate arc

Fact register: clean-slate discovery arc facts.
Fact IDStatement
F-CS-WINTAPThe win-tap (6 lanes closed, lead ≥ 0: aim the last unclosed target, min-miss single when 1 away — a tie is a winning position) was found at iteration 3 (X102_WinTap). The main lineage never found the full mechanic in 115 entries; its absence is X188's biggest blunder class.
F-CS-FAUCETFaucet denial (X110_FaucetShut: while covering at a crushing lead, shut the opponent's income lanes before claiming empty ones) jumped the champion 55.0% → 64.4% mean — +9.4pp in one step, the largest single mechanic found in the project. Signature: PS +17.9, S6 +15.7, E1 +13.9pp.
F-CS-TRAJChampion trajectory (home-pool mean/min at keep; pool grew 11 → 18): X109 55.0/49.6 → X123 66.0/56.6 → X129 → X147 65.7/53.9 → X159 65.1 → X168 64.9/49.3 → X181 64.40/50.44, the first 18/18 sweep (seed-7 re-bench 64.40/50.41). Home means are not comparable across pool sizes.
F-CS-XBENCH-X123Cross-lineage milestone 1: X123 on the main repo's expanded 14-bot pool scores 63.3% mean vs X188's 60.65% baseline (+2.66pp same-seed, 14/14 beaten); head-to-head vs X188: 55.8 / 52.4 / 50.0% (amateur/good/pro).
F-CS-XBENCH-X129Cross-lineage milestone 2: X129 scores 64.8% (+4.13pp vs X188, same seed), min 56.7%; head-to-head vs X188 57.5 / 56.2 / 55.6% — beats X188 at every skill profile.
F-CS-XBENCH-LATERNo committed fixed-reference cross-bench exists for X147/X168/X181 on the main pool; their superiority over X188 is evidenced by the unified ladder (1162/1173/1176 vs 1118) and the direct head-to-head (X181 beats X188 54.6%, 4,000 games).
F-CS-OVERFIT-ECHOThe clean lineage independently re-derived main-lineage findings (the mr≤9 dead-code result; the value-ordered lane-selection law) — basin convergence, not path artifact.
F-CS-UDUnderdog program champion: The Grinder (X214_UD) at 17.63% mean / 13.31% min underdog WR vs the X181-clone baseline's 15.31% — a 15% relative gain (seed-7: 17.65/13.22). Recalibration: brake at any deficit, farm cap 80 → 10; the farm/deny boundaries are skill-relative.
F-CS-BULLTAPMin-miss single-aim bull taps (X179_BullTapSingle) worth +0.27pp with no losing matchup; composed into X181. Corrects the main lineage's double-at-bull tap doctrine.

The Opus control arm

Fact register: Opus control arm facts.
Fact IDStatement
F-OPUS-RESULTSecond clean-room arm (10-iteration budget, original 11-bot pool) ended at X107_PSChaseMild8: 55.2% mean / 51.7% min, beats all 11 opponents (E12 baseline: 54.6% mean).
F-OPUS-VS-FABLEAt the same iteration count the other arm was at ~55.0% on the same pool — comparable; it then ran 100+ more iterations and found faucet denial. X107 cross-benched on the main expanded pool: 50.9% mean, 30–39% head-to-head vs X188.
F-OPUS-PLATEAUThe arm's own reflection: threshold tuning saturated (single-peaked mean-vs-multiplier curve, best 55.2% at mult 8); further gains would need a genuinely new mechanic the session did not reach.

The AlphaZero cell

Fact register: AlphaZero cell facts.
Fact IDStatement
F-AZ-RESULTFinal artifact run2/value_iter1.pt with depth-3 batched-leaf expectimax: 65.6% mean over the 11-bot pool (22,000 games/config, SE ≈ 0.32pp); net-only 63.4%, depth-2 64.3%.
F-AZ-CEILINGIt did not beat the search-only ceiling: shallow expectimax with a hand-rolled race-heuristic leaf scores 67.8% (depth 2, 500 games/bot). The value net re-derived, but did not surpass, an explicit darts-to-victory computation.
F-AZ-REPLAYBUGMost consequential training decision: replay-buffer recency, worth +9pp. Run 1's buffer never evicted, so every epoch re-fit near-random iter-0 self-play; the ".55 plateau" was stale-data anchoring. An empty-buffer restart jumped the pool mean +7–12pp in one iteration.
F-AZ-ELONo anchored Elo for the AZ artifact exists in any committed file; the cell is absent from the ladder's match data. Chapters state its 65.6% pool mean instead.

The main lineage and the classics

Fact register: main lineage and classic catalog facts.
Fact IDStatement
F-MAIN-X188X188 kept 2026-04-21 at entry 089 (detector-branch chase denial). Baseline on the expanded 14-bot pool: 60.65% mean / 51.8% min, sweeping 14/14. Champion through entry 115; the 2026-07-01 session produced zero keeps — the sweep saturated.
F-MAIN-BULLTAPX188's tap doctrine aims HIT_DOUBLE at the bull, inherited from entry 003. Two independent lines proved it wrong: the clean-slate min-miss single bull tap (+0.27pp) and the tablebase blunder catalog (0.542 forfeited playing DB at a tie).
F-MAIN-DEADCODEEntry 107 proved X188's mr≤9 endgame-gate clause is dead code; entry 115 proved its tie-chase is unreachable (> vs ≥ bit-identical).
F-MAIN-LINEAGEVALUECross-bench of the Opus arm's 10-iteration champion on the expanded pool (50.9% mean, 30–39% vs X188) quantifies 89 iterations of lineage at ~10pp of mean.
F-CLASSIC-E12E12 (FinishOppClosed1) was the best classic bot: #1 of the 30 pre-X strategies, seed of the main lineage (X103) and the RL arc's BC teacher. Unified ladder: 1018 (mid-table); most exploitable audited artifact (82.1%).
F-CLASSIC-SBOTSOf the 17 Frongello S-bots: S2 (score/cover at lead 0) is the strongest at every MPR level of the sweep; S14/S10 (chase variants) top the S-family on the ladder (1037/1032 vs S2's 1014); S1 (pure closer, anchor) and S6 (949.5, extra darts) bracket the bottom.
F-MPR-SWEEPX188 MPR sweep (31 strategies, 20,000 games/matchup, equal skill, 11 MPR levels 0.8–5.6): X188 ranks #1 at all 11 levels, average WR 59.0% → 68.0%; E12 #2 at all 11 (57.8–59.5%); best S-bot S2 (57.4–59.2%). Scope caveat: within its 31-strategy field only — the sweep predates the clean-slate and RL artifacts, which beat X188. The old site's "never loses a head-to-head" claim is falsified on the unified field.

Timeline

Fact register: project timeline facts.
Fact IDStatement
F-TIME-SESSIONS36 research sessions, 2026-02-05 → 2026-07-02. By month: Feb 23, Mar 4, Apr 4, Jul 5 — the project slept ~10 weeks between the X188 keep and the clean-slate/RL/tablebase burst.
F-TIME-BIRTHProject birth 2026-02-05 (the Q-learning/A2C era); first public commit 2026-03-19.
F-TIME-AUTORESEARCHThe autoresearch loop was born 2026-04-20; X188 was kept 2026-04-21 — 89 entries in two days.
F-TIME-CLEANSLATE2026-07-01: the clean-slate arm (114 entries), the Opus control arm, and the RL arc's BC→PPO run all started the same day. 2026-07-02: unified ladder, tablebase solve + audit, exploitability audit, underdog program — and the freeze.
F-TIME-FREEZESite facts frozen as of July 2, 2026. Every champion claim on the site carries this stamp.