Every number on this site traces to one fact sheet
This site makes quantitative claims — rankings, win rates, Elo ratings,
head-to-head results — and every one of them was transcribed from a single
facts sheet, itself read from committed research artifacts on July 2, 2026, the
project's freeze date. Each fact carries a stable ID (F-ELO-IT3,
F-TB-TIEBLUNDER, …). When a chapter states a load-bearing number,
it links that ID to this page; the anchors land either on the row of a table below
or on the entry in the fact register at the bottom.
If a claim has no fact ID, it is context, not evidence.
Two conventions apply everywhere. First, results are frozen: "champion" means champion as of July 2, 2026, and no page updates that date. Second, every win rate carries its sample size, because a win rate without an n is a rumor.
The catalog: 27 strategies, three lineages, one scale
The table below is the project's roster, ordered by the unified Elo ladder that put all three lineages — the classic catalog, the hand-coded discovery champions, and the neural policies — on one scale (One Ladder tells the story). Ratings are Bradley–Terry maximum-likelihood fits over 253 C-engine pairings (4,000 games each) plus 58 Python-engine pairings (1,500 games each), anchored at S1 = 1000 . As of July 2, 2026, the top of the ladder belongs to the reinforcement-learning policies: The Closer (it3) rates 1222 , about 47 Elo clear of the best hand-coded strategy, The Shutout (X181), which it beats 60.0% head-to-head (1,500 games) .
A scope note on coverage: the classic catalog is larger than its ladder representation. The project's pre-discovery field held 30 strategies — the 17 Frongello S-bots plus experimental E-variants — but only the representatives that later work trained against or audited were rated. Per the naming convention, five artifacts carry style-of-play names ; everything else goes by its research ID.
| # | Strategy | Lineage | Elo | ±SE | Games | One line |
|---|---|---|---|---|---|---|
| 1 | The Closer (it3) | RL | 1222.0 | 2.8 | 24,000 | PPO league self-play policy; strongest artifact in the project; 74.0% agreement with the exact endgame solve |
| 2 | it2_det | RL | 1215.7 | 2.7 | 24,000 | Determinism milestone: step-penalty fine-tune that ended point-hoarding — 0 stalls in 26,000 eval games |
| 3 | PPO_run2 | RL | 1205.8 | 2.7 | 24,000 | Stochastic PPO policy (75.79% pool mean); its argmax deadlocks — randomness was load-bearing |
| 4 | The Shutout (X181) | Clean slate | 1175.5 | 1.7 | 94,000 | Best hand-coded strategy: race-gated banking, faucet denial, min-miss bull taps; first 18/18 home-pool sweep |
| 5 | X168 | Clean slate | 1173.3 | 1.7 | 94,000 | X168_LastLaneED — added the lineage's sixth extra-darts redirect site |
| 6 | X159 | Clean slate | 1167.1 | 1.7 | 94,000 | X159_BrokeDenyED, the anti-champion: beat then-champion X147 head-to-head |
| 7 | X147 | Clean slate | 1162.2 | 1.7 | 94,000 | X147_BrakeTap45 — hoard brake at −45 plus min-miss taps |
| 8 | X129 | Clean slate | 1135.1 | 1.7 | 94,000 | Endgame-gated extra-darts redirect; beat X188 at every skill profile on the cross-bench |
| 9 | The Shape Reader (X188) | Main | 1117.5 | 1.7 | 88,000 | The 89-iteration main-lineage champion and the old site's hero; detector-branch chase denial |
| 10 | X109 | Main | 1113.0 | 1.7 | 88,000 | The main-lineage keep X188 improved on; owner of the audit's largest catalogued blunder (0.563 win prob at a tie) |
| 11 | X123 | Clean slate | 1107.3 | 1.6 | 94,000 | Denial + descending-value walk + extra-darts redirect; the first cross-lineage milestone (+2.66pp over X188, same seed) |
| 12 | The Textbook (DIST) | RL distillation | 1085.2 | 2.6 | 24,000 | The neural champion's doctrine as human-readable rules; beat all seven classics it played, loses ~62% to the X1xx champions |
| 13 | X165 | Main | 1072.7 | 1.7 | 88,000 | Home-pool keep that overfit its frontier pool — ~40 Elo below X109 on the unified field |
| 14 | X112 | Main | 1070.8 | 1.7 | 88,000 | Home-pool keep, same overfit pattern as X165 |
| 15 | H1_Hoarder | Clean slate | 1052.2 | 1.7 | 88,000 | Point-hoarder injected as a "weak" pool trainer; outrates every classic baseline |
| 16 | S14 | Classic | 1036.6 | 1.6 | 94,000 | Chase variant; top classic S-bot on the unified ladder |
| 17 | S10 | Classic | 1031.7 | 1.6 | 94,000 | Chase variant, second of the S-family pair atop the classics |
| 18 | E12 | Classic | 1018.3 | 1.6 | 94,000 | FinishOppClosed1: best classic bot, seed of the main lineage and the RL teacher — and the most exploitable audited artifact (82.1%) |
| 19 | X103 | Main | 1018.3 | 1.7 | 88,000 | The main lineage's founding entry, seeded from E12 |
| 20 | E10 | Classic | 1013.7 | 1.7 | 88,000 | Experimental variant of the classic catalog |
| 21 | S2 | Classic | 1013.6 | 1.6 | 94,000 | Score/cover at lead 0; strongest classic S-bot at every skill level of the old MPR sweep |
| 22 | E3 | Classic | 1012.3 | 1.6 | 94,000 | Experimental variant of the classic catalog |
| 23 | S1 | Classic | 1000.0 | — | 88,000 | Pure closer; the ladder's fixed anchor (Elo 1000 by definition) |
| 24 | E1 | Classic | 992.4 | 1.7 | 94,000 | Experimental variant of the classic catalog |
| 25 | E11 | Classic | 989.7 | 1.7 | 88,000 | Experimental variant of the classic catalog |
| 26 | PS | Classic | 986.7 | 1.7 | 88,000 | Points-first baseline |
| 27 | S6 | Classic | 949.5 | 1.7 | 94,000 | Extra-darts classic; bottom of the rated field |
The fit is honest about its own error: the mean gap between observed head-to-head results and the Bradley–Terry prediction is 5.08pp, with a worst residual of 16.53pp (S10 vs E1: observed 39.1% where the model predicts 55.6%) . And the field is unusually well-behaved — an intransitivity scan over every ordered triple found zero rock-paper-scissors cycles in which every edge clears 52% (≈2.5σ at 4,000 games): to measurement precision, the 27 strategies form a total order .
Selected head-to-heads
A sample of observed pairwise results from the ladder's match data; row beats column at the stated rate. Raw win/loss counts for all 311 pairings are in ladder.json.
| Matchup (winner first) | Win rate | Source |
|---|---|---|
| The Closer (it3) vs S6 | 90.4% | F-ELO-H2H |
| The Shutout (X181) vs PS | 84.4% | F-ELO-H2H |
| The Closer (it3) vs S2 | 81.7% | F-ELO-H2H |
| The Shutout (X181) vs X165 | 74.5% | F-ELO-H2H |
| The Closer (it3) vs X123 | 65.3% | F-ELO-H2H |
| The Closer (it3) vs The Shutout (X181) | 60.0% | F-ELO-RLTOP (1,500 games) |
| X129 vs The Shape Reader (X188) | 57.0% | F-ELO-H2H |
| The Shutout (X181) vs The Shape Reader (X188) | 54.6% | F-ELO-LINEAGEGAP |
| The Closer (it3) vs it2_det | 54.5% | F-ELO-RLTOP (1,500 games) |
| it2_det vs PPO_run2 | 52.9% | F-ELO-RLTOP (1,500 games) |
| The Shape Reader (X188) vs X109 | 50.8% | F-ELO-H2H |
| X123 vs The Shape Reader (X188) | 50.6% | F-ELO-H2H |
Shading: head-to-head win rate 51% → 90%.
Off the ladder: rated by other instruments
Five artifacts matter to the story but do not appear on the unified ladder, either because they compete on a different grid (skill-mismatched play), because no Elo fit was ever committed, or because they were superseded before the ladder existed. Their headline numbers come from their own benchmark protocols and are not directly comparable to ladder Elo.
| Artifact | Arm | Headline number | Why it is not on the ladder |
|---|---|---|---|
| The Grinder (X214_UD) | Clean-slate underdog program | 17.63% mean underdog win rate vs the X181-clone baseline's 15.31% — a 15% relative gain; seed-7 validation 17.65% | Specialist for playing up a skill class; rated on the underdog gap grid, not the even-skill ladder |
| AZ value_iter1 (run 2) | AlphaZero cell | 65.6% mean over the 11-bot pool with depth-3 expectimax (22,000 games/config, SE ≈ 0.32pp) | No committed Elo fit exists; the cell is absent from the ladder's match data |
| X107_PSChaseMild8 | Opus control arm | 55.2% mean / 51.7% min on its 11-bot pool, beating all 11 | 10-iteration budget champion; cross-benched at 50.9% mean on the main expanded pool |
| bc_e12 | RL arc (behavior cloning) | 55.10% greedy pool mean — a 99.7%-accurate clone that exactly recovers its teacher E12 (55.13%) | Stepping stone to PPO, superseded within the arc |
| A2C v19 | Historical RL (Feb 2026) | 49.3% stable win rate vs 11 hard bots | The pre-PPO ceiling; superseded by the 2026-07 BC→PPO→it3 arc |
Methodology: how the numbers were made
Benchmarks and error bars
Two simulators produced every result: a C engine (fast_sim.c) for
high-volume tournaments and the original Python engine. Games alternate the first
thrower and use the pro/pro skill profile unless a fact says otherwise. The standard
cell sizes and their 95% Monte-Carlo intervals :
| Cell | Games | 95% MC interval |
|---|---|---|
| C-engine ladder pairing | 4,000 | ±0.8pp |
| Python ladder pairing | 1,500 | ±1.3pp |
| RL / exploitability eval | 2,000 | ±1.1pp |
| Hybrid endgame match | 50,000 | ±0.44pp |
Key results were replicated under a second seed before being trusted: the ladder ran on seed 42 ; The Shutout's home-pool sweep reproduced under a seed-7 re-bench (64.40% / 50.41% vs 64.40% / 50.44%) , as did The Grinder's underdog result (17.65 / 13.22 vs 17.63 / 13.31) , and the PPO pool mean replicated on an independent seed at 76.01% vs 75.79% .
The ladder fit
The unified ladder is a Bradley–Terry maximum-likelihood fit with anchor S1 = 1000 and scale 400/ln 10, over 253 C-engine pairings (a full round robin at 4,000 games each) plus 58 Python-engine pairings (1,500 games each, 0 stalls). Standard errors come from inverse Fisher information with the anchor fixed. Cross-engine trust rests on port validation: clean-slate champions ported into the C engine were verified bit-identical to their home engine, and Python ports agreed within 1.5pp of C against a common opponent . Independent corroboration: the RL arc's own keep-rule protocol pegged it3 at Elo 1215.9 from different match data — a 6-Elo (≈0.9pp) gap from the ladder's 1222.0 .
Stall accounting
Cricket has a degenerate corner: a player can refuse to close its last targets and farm points forever. The RL arc's first PPO run found and abused it — 13–34% of greedy-eval games deadlocked, and counting those stalls as losses dropped the run's mean to 46.29% . Every frozen result since uses the fix and reports the count: a 900-dart episode cap with stall scored as a loss. The ladder's Python pairings ran 0 stalls ; it2_det ran 0 stalls in 26,000 eval games (0 in 52,000 across two seeds) , and it3 ran 0 stalls ; the exploitability probes ran 0 stalls in all cells . A "0 stalls" note next to a result is that accounting, not a boast.
The tablebase solve and its gates
The exact endgame solve (Ground Truth) covers
the slice of cricket with at most 6 of 14 lanes still open and score difference
within ±250, both players on the pro profile:
2,767,444 mark patterns × 3 darts × 501 score diffs =
4.16 billion states, 16.6 GB of exact win probabilities,
solved shell-by-shell so each openness shell is a usable checkpoint on its own
. Four validation gates all passed: mechanical equivalence
against the game engine, analytic spot-checks, a boundary-bracket re-solve of the
inner shells, and a Monte-Carlo gate of 20 solved states × 100,000 games each
(worst |z| = 2.63, 0 fails) . Validation earned its keep:
it caught one real bug — biased first-dart RNG seeding under small sequential
seeds, fixed by splitmix64 hardening (commit 8b49948)
.
The endgame audit
With the solve in hand, 13 strategies were audited: 6,000 games each (2,000 vs each of X188, CS_X181, and S2), with every in-slice decision — 52,000 to 75,000 per participant — scored against the exact win probabilities. "Agreement" is tie-aware (a choice within 10−6 of the best move counts as agreeing); a "blunder" forfeits more than 0.05 win probability in one decision . Together with the unified ladder, this audit is the evidence base for the neural policies — there is no policy tournament grid.
| Rank | Participant | Lineage | Agreement | EV lost/game (95% CI) | Blunder rate |
|---|---|---|---|---|---|
| 1 | The Closer (it3) | RL | 74.0% | 0.0272 (0.0261–0.0283) | 0.38% |
| 2 | X123 | Clean slate | 58.0% | 0.0319 (0.0300–0.0339) | 1.15% |
| 3 | H1_Hoarder | Clean slate | 39.0% | 0.0369 (0.0342–0.0398) | 2.07% |
| 4 | The Shutout (X181) | Clean slate | 67.7% | 0.0400 (0.0379–0.0421) | 1.17% |
| 5 | X129 | Clean slate | 57.2% | 0.0428 (0.0407–0.0450) | 1.67% |
| 6 | X168 | Clean slate | 62.9% | 0.0430 (0.0406–0.0453) | 1.09% |
| 7 | PS | Baseline | 51.9% | 0.0475 (0.0436–0.0511) | 2.48% |
| 8 | X147 | Clean slate | 63.5% | 0.0497 (0.0472–0.0521) | 1.69% |
| 9 | X159 | Clean slate | 63.9% | 0.0507 (0.0484–0.0532) | 1.85% |
| 10 | The Shape Reader (X188) | Main | 61.6% | 0.0519 (0.0494–0.0547) | 3.42% |
| 11 | X109 | Main | 57.3% | 0.0583 (0.0556–0.0611) | 4.04% |
| 12 | S2 | Baseline | 43.8% | 0.0990 (0.0943–0.1040) | 4.63% |
| 13 | E12 | Main | 43.2% | 0.1039 (0.0990–0.1090) | 4.68% |
Shading: tablebase agreement 39% → 74%.
Read the ranking by EV lost, not by agreement: H1_Hoarder agrees with the solve only 39% of the time yet leaks less per game than The Shutout at 67.7% agreement, because most of its disagreements are near-ties . A second instrument from the same solve: hybrid matches, where a champion plays itself except that in-slice decisions are replaced by tablebase-greedy play (50,000 games each, ±0.44pp; plain-vs-plain is 50% by symmetry):
| Champion | Hybrid win rate | Endgame gain | Tablebase darts/game |
|---|---|---|---|
| The Shape Reader (X188) | 54.40% | +4.4pp | 7.5 |
| The Shutout (X181) | 53.54% | +3.5pp | 11.0 |
| The Closer (it3) | 52.76% | +2.8pp | 9.5 |
| X123 | 51.40% | +1.4pp | 9.5 |
Exploitability probes
How much can a dedicated adversary punish each strategy? For five frozen targets, a PPO best-response was trained against that target alone (warm-started from the champion) and evaluated at 2,000 games per mode, stalls counted as losses, 0 stalls observed . Lower is harder to exploit: the attack barely beats a coin flip against The Closer but wins 82% against E12.
| Target | Best-response WR (greedy) | (sampled) |
|---|---|---|
| The Closer (it3) | 53.2% | 53.0% |
| The Shutout (X181) | 59.2% | 57.3% |
| X129 | 61.1% | 60.8% |
| The Shape Reader (X188) | 72.1% | 69.7% |
| E12 | 82.1% | 81.2% |
Shading: best-response win rate 53% → 82%.
Data downloads
Two data files ship with this site, both copied verbatim from committed research artifacts on the freeze date. Everything a chapter claims about the ladder or the endgame audit can be recomputed from them.
| File | Contents | Provenance |
|---|---|---|
| ladder.json | Ladder metadata (protocol, port validation), the 27-entry rating table with standard errors, and all 311 raw match records (strategies, wins, games, engine, seed) | Verbatim copy of autoresearch_strategies/elo_matches.json,
2026-07-02, unmodified |
| endgame_audit.json | The 13-participant audit ranking (agreement, EV loss with 95% CI, blunder rate, breakdowns by darts/diff/openness) plus the 50,000-game hybrid match results | Field values assembled unmodified from tablebase/audit_out/*.json;
lineage labels per tablebase/AUDIT.md |
The previous site's data trees are retained at their old paths for link stability; the chapters on this site transcribe only from the frozen facts sheet, not from legacy JSON.
The fact register
Every citable fact on this site, by stable ID. Chapter links land here (or on the catalog and audit rows above, which carry their own IDs). Statements are condensed; the frozen facts sheet in the repository holds the full wording and the source path for each. All results as of July 2, 2026.
Named artifacts
| Fact ID | Statement |
|---|---|
| F-NAME-SHUTOUT | The Shutout = X181 (clean-slate X181_RaceBull): best hand-coded strategy — race-gated banking, faucet denial, min-miss bull taps. |
| F-NAME-CLOSER | The Closer = it3 (it3_league_ckpt0600, PPO league iteration 3, greedy): strongest artifact in the project. |
| F-NAME-TEXTBOOK | The Textbook = DIST: interpretable rule-bot distilled from the PPO policy. |
| F-NAME-GRINDER | The Grinder = X214_UD (X214_UD_Brake4): underdog (outgunned-player) champion. |
| F-NAME-SHAPEREADER | The Shape Reader = X188 (X188_X186DetectorChaseDenial): 89-iteration main-lineage champion, the old site's hero. |
The unified ladder
| Fact ID | Statement |
|---|---|
| F-ELO-METHOD | 27 artifacts rated by Bradley–Terry MLE, anchor S1 = 1000, scale 400/ln 10; 253 C-engine pairings (4,000 games each, pro/pro, seed 42) + 58 Python pairings (1,500 games each, 0 stalls); C ports validated bit-identical, Python ports within 1.5pp; SEs from inverse Fisher information. |
| F-ELO-FIT | Model fit: mean |observed − predicted| edge error 5.08pp; max residual 16.53pp (S10 vs E1: observed 39.1%, predicted 55.6%). |
| F-ELO-RLTOP | The RL policies top the ladder; it3 clears X181 by ~47 Elo and beats it 60.0% head-to-head (1,500 games); it3 > it2_det 54.5%, it2_det > PPO_run2 52.9%. |
| F-ELO-LINEAGEGAP | Every clean-slate champion from X129 up (1135–1176) rates above main-lineage champion X188 (1118); X181 beats X188 54.6% (top-of-lineage gap ~58 Elo). |
| F-ELO-OVERFIT | Within the main lineage only X188 improved on X109 on the unified field; X112 (1071) and X165 (1073), both home-pool "keeps", rate ~40 Elo below X109 (1113) — overfit to a fixed frontier pool. |
| F-ELO-DIST-STORY | DIST (1085) recovers about a third of the Elo gap from the classic baselines to its teacher PPO_run2 (1206); beats all seven classic representatives it played (55.7–70.9%), loses ~62% to the X1xx champions. |
| F-ELO-H1-STORY | H1_Hoarder (1052), injected as a "weak" pool trainer, outrates every Frongello baseline. |
| F-ELO-CYCLES | Zero intransitive cycles clear a >52% bar (≈2.5σ at 4,000 games): the 27-strategy field is, to measurement precision, a total order. 30 near-cycles exist inside the >50% noise band; strongest: E11 → S1 → S14 → E11 (weakest edge 51.9%). |
Endgame audit
| Fact ID | Statement |
|---|---|
| F-TB-METHOD | 13 participants × 6,000 games each (2,000 vs each of X188, CS_X181, S2; pro/pro, alternating starters); every in-slice decision (openness ≤ 6) scored against exact win probabilities; agreement tie-aware (< 10−6); blunder = > 0.05 forfeited; 52k–75k decisions per participant. |
| F-TB-IT3 | The Closer (it3) has the best endgame of anything built: 74.0% agreement, 2.7pp/game leak (71,961 decisions), 0.38% blunder rate — half X188's leak, one-ninth its blunder rate, with no endgame-specific training. |
| F-TB-X188WEAK | X188 ranks 10/13 in the endgame slice, forfeiting 5.2 win-probability points per game (52,415 decisions, 3.42% blunder rate); every clean-slate champion beats it there. Its rating is carried by opening/midgame play. |
| F-TB-AHEAD | Agreement collapses when ahead: 52–62% for everyone (vs up to 90.2% when behind — it3's behind-agreement). Behind, chase heuristics coincide with optimal play; ahead, optimal play keeps scoring and denying while champions race to close. |
| F-TB-TIEBLUNDER | Largest catalogued single-decision forfeit: X109 gives up 0.563 win prob at a tied score with one dart (played double bull EV 0.403; optimal S16 EV 0.966 — the single wins because ties go to the closer). X188 has the identical blind spot (forfeits 0.542: played DB, optimal S15). |
| F-TB-MOTIF | Of 260 catalogued exemplar blunders (loss > 0.05, top-20 per participant): closed-when-should-score 188, scored-when-should-close 55, closed wrong lane 17. Dominant leak: throwing at one's own last open lane (usually the bull) when behind or level. |
| F-TB-AGREEMENT-VS-EV | Agreement % and EV lost measure different things: H1_Hoarder agrees only 39% yet loses less per game than X181 at 67.7% agreement — most of its disagreements are near-ties. Rank by EV, not match rate. |
The tablebase solve
| Fact ID | Statement |
|---|---|
| F-TBSOLVE-SIZE | Solved slice: openness ≤ 6, diff band ±250, pro/pro — 2,767,444 mark patterns × 3 darts × 501 diffs = 4.16 billion states, 16.6 GB of f32 values, solved shell-by-shell. |
| F-TBSOLVE-GATES | All validation gates passed: mechanical equivalence, analytic spot-checks, boundary-bracket re-solve of shells ≤ 4, and a Monte-Carlo gate (20 states × 100,000 games; worst |z| = 2.63, 0 fails). |
| F-TBSOLVE-SEEDBUG | One real bug found in validation: raw xoshiro seeding biased the first dart under small sequential seeds; fixed by splitmix64 hardening (commit 8b49948). |
Exploitability
| Fact ID | Statement |
|---|---|
| F-EXP-READING | Lower best-response win rate = harder to exploit. A PPO adversary trained solely against it3 reaches 53.2% — barely better than a coin flip; the same attack beats X188 72% of the time and E12 82%. (2,000 games/mode, stalls-as-losses, 0 stalls in all cells.) |
The RL arc
| Fact ID | Statement |
|---|---|
| F-RL-BASELINE | Best heuristic on the 11-bot hard pool: E12 at 55.13% mean (2,000 games/matchup, pro/pro). |
| F-RL-BC | BC clone of E12 (99.7% action accuracy): 55.10% greedy mean — imitation exactly recovers its teacher. |
| F-RL-RUN1 | PPO run 1 failed by reward hacking: refusing to close and hoarding points indefinitely; 13–34% of greedy games deadlocked; stalls-as-losses mean 46.29%. Fix: 900-dart cap, stall = loss. |
| F-RL-PPO | PPO run 2 (stochastic): 75.79% mean over the 11-bot pool (2,000 games/matchup, 0 stalls in 22,000 games, every matchup > 50%); seed replication 76.01%. Caveats: its argmax deadlocks, and strength is partly pool overfitting. |
| F-RL-HELDOUT | Vs held-out X123 (never in training): PPO 64.8%, BC clone 29.5% (2,000 games) — fine-tuning worth +35pp on a structurally novel denial opponent. |
| F-RL-IT2 | it2_det (step-penalty fine-tune): greedy mean 72.89%, min 57.65%, 0 stalls in 26,000 games (0 in 52,000 across two seeds); hoarding gone — median game vs S2 is 52 darts, was 88. |
| F-RL-IT3 | it3 (league self-play): 15-member league mean 70.41% greedy, min 52.45% (vs its parent it2), 0 stalls; floor gain +6.45pp on the 14-member league. Anchored Elo 1215.9 on the keep-rule protocol, consistent with the ladder's 1222.0 (Δ6 Elo ≈ 0.9pp). |
| F-RL-SKILLGRID | The champion transfers across skill without retraining: league mean 70.41% pro / 70.28% good / 66.56% amateur, 0 stalls, beats every member at every profile (500 games/matchup). |
| F-RL-PARETO | A 3×256 capacity jump moved the hard band +2–4pp but paid it back on the easy band — league mean flat over 5 evals. The ceiling is a hard/easy Pareto frontier (~70.4 league / ~1216 Elo for this policy class); extra exploration was net-negative (−1pp). |
| F-RL-INFERENCE | Opponent inference closed by measurement: features carry identity (37–38% top-1 over 15 classes vs 6.7% chance), the policy conditions on it (12.3% of greedy actions flip), yet league mean stays flat (70.41 / 70.9 / 70.4). |
| F-RL-BRENVELOPE | The per-opponent best-response envelope above the champion sums to ≈+1.8pp league mean (≈+13–14 Elo) under oracle assumptions — below the +2pp/+15-Elo keep bar. Measured: BR-vs-S14 +3.95pp, BR-vs-S16 +2.5pp, BR-vs-X129 +1.6pp. |
| F-RL-A2C | Historical A2C ceiling (Feb 2026): 49.3% stable WR vs 11 hard bots (v19, entropy floor 0.01). Superseded by the 2026-07 arc: 49.3% → 75.79% pool mean → 70.41% deterministic league champion. |
The clean-slate arc
| Fact ID | Statement |
|---|---|
| F-CS-WINTAP | The win-tap (6 lanes closed, lead ≥ 0: aim the last unclosed target, min-miss single when 1 away — a tie is a winning position) was found at iteration 3 (X102_WinTap). The main lineage never found the full mechanic in 115 entries; its absence is X188's biggest blunder class. |
| F-CS-FAUCET | Faucet denial (X110_FaucetShut: while covering at a crushing lead, shut the opponent's income lanes before claiming empty ones) jumped the champion 55.0% → 64.4% mean — +9.4pp in one step, the largest single mechanic found in the project. Signature: PS +17.9, S6 +15.7, E1 +13.9pp. |
| F-CS-TRAJ | Champion trajectory (home-pool mean/min at keep; pool grew 11 → 18): X109 55.0/49.6 → X123 66.0/56.6 → X129 → X147 65.7/53.9 → X159 65.1 → X168 64.9/49.3 → X181 64.40/50.44, the first 18/18 sweep (seed-7 re-bench 64.40/50.41). Home means are not comparable across pool sizes. |
| F-CS-XBENCH-X123 | Cross-lineage milestone 1: X123 on the main repo's expanded 14-bot pool scores 63.3% mean vs X188's 60.65% baseline (+2.66pp same-seed, 14/14 beaten); head-to-head vs X188: 55.8 / 52.4 / 50.0% (amateur/good/pro). |
| F-CS-XBENCH-X129 | Cross-lineage milestone 2: X129 scores 64.8% (+4.13pp vs X188, same seed), min 56.7%; head-to-head vs X188 57.5 / 56.2 / 55.6% — beats X188 at every skill profile. |
| F-CS-XBENCH-LATER | No committed fixed-reference cross-bench exists for X147/X168/X181 on the main pool; their superiority over X188 is evidenced by the unified ladder (1162/1173/1176 vs 1118) and the direct head-to-head (X181 beats X188 54.6%, 4,000 games). |
| F-CS-OVERFIT-ECHO | The clean lineage independently re-derived main-lineage findings (the mr≤9 dead-code result; the value-ordered lane-selection law) — basin convergence, not path artifact. |
| F-CS-UD | Underdog program champion: The Grinder (X214_UD) at 17.63% mean / 13.31% min underdog WR vs the X181-clone baseline's 15.31% — a 15% relative gain (seed-7: 17.65/13.22). Recalibration: brake at any deficit, farm cap 80 → 10; the farm/deny boundaries are skill-relative. |
| F-CS-BULLTAP | Min-miss single-aim bull taps (X179_BullTapSingle) worth +0.27pp with no losing matchup; composed into X181. Corrects the main lineage's double-at-bull tap doctrine. |
The Opus control arm
| Fact ID | Statement |
|---|---|
| F-OPUS-RESULT | Second clean-room arm (10-iteration budget, original 11-bot pool) ended at X107_PSChaseMild8: 55.2% mean / 51.7% min, beats all 11 opponents (E12 baseline: 54.6% mean). |
| F-OPUS-VS-FABLE | At the same iteration count the other arm was at ~55.0% on the same pool — comparable; it then ran 100+ more iterations and found faucet denial. X107 cross-benched on the main expanded pool: 50.9% mean, 30–39% head-to-head vs X188. |
| F-OPUS-PLATEAU | The arm's own reflection: threshold tuning saturated (single-peaked mean-vs-multiplier curve, best 55.2% at mult 8); further gains would need a genuinely new mechanic the session did not reach. |
The AlphaZero cell
| Fact ID | Statement |
|---|---|
| F-AZ-RESULT | Final artifact run2/value_iter1.pt with depth-3 batched-leaf expectimax: 65.6% mean over the 11-bot pool (22,000 games/config, SE ≈ 0.32pp); net-only 63.4%, depth-2 64.3%. |
| F-AZ-CEILING | It did not beat the search-only ceiling: shallow expectimax with a hand-rolled race-heuristic leaf scores 67.8% (depth 2, 500 games/bot). The value net re-derived, but did not surpass, an explicit darts-to-victory computation. |
| F-AZ-REPLAYBUG | Most consequential training decision: replay-buffer recency, worth +9pp. Run 1's buffer never evicted, so every epoch re-fit near-random iter-0 self-play; the ".55 plateau" was stale-data anchoring. An empty-buffer restart jumped the pool mean +7–12pp in one iteration. |
| F-AZ-ELO | No anchored Elo for the AZ artifact exists in any committed file; the cell is absent from the ladder's match data. Chapters state its 65.6% pool mean instead. |
The main lineage and the classics
| Fact ID | Statement |
|---|---|
| F-MAIN-X188 | X188 kept 2026-04-21 at entry 089 (detector-branch chase denial). Baseline on the expanded 14-bot pool: 60.65% mean / 51.8% min, sweeping 14/14. Champion through entry 115; the 2026-07-01 session produced zero keeps — the sweep saturated. |
| F-MAIN-BULLTAP | X188's tap doctrine aims HIT_DOUBLE at the bull, inherited from entry 003. Two independent lines proved it wrong: the clean-slate min-miss single bull tap (+0.27pp) and the tablebase blunder catalog (0.542 forfeited playing DB at a tie). |
| F-MAIN-DEADCODE | Entry 107 proved X188's mr≤9 endgame-gate clause is dead code; entry 115 proved its tie-chase is unreachable (> vs ≥ bit-identical). |
| F-MAIN-LINEAGEVALUE | Cross-bench of the Opus arm's 10-iteration champion on the expanded pool (50.9% mean, 30–39% vs X188) quantifies 89 iterations of lineage at ~10pp of mean. |
| F-CLASSIC-E12 | E12 (FinishOppClosed1) was the best classic bot: #1 of the 30 pre-X strategies, seed of the main lineage (X103) and the RL arc's BC teacher. Unified ladder: 1018 (mid-table); most exploitable audited artifact (82.1%). |
| F-CLASSIC-SBOTS | Of the 17 Frongello S-bots: S2 (score/cover at lead 0) is the strongest at every MPR level of the sweep; S14/S10 (chase variants) top the S-family on the ladder (1037/1032 vs S2's 1014); S1 (pure closer, anchor) and S6 (949.5, extra darts) bracket the bottom. |
| F-MPR-SWEEP | X188 MPR sweep (31 strategies, 20,000 games/matchup, equal skill, 11 MPR levels 0.8–5.6): X188 ranks #1 at all 11 levels, average WR 59.0% → 68.0%; E12 #2 at all 11 (57.8–59.5%); best S-bot S2 (57.4–59.2%). Scope caveat: within its 31-strategy field only — the sweep predates the clean-slate and RL artifacts, which beat X188. The old site's "never loses a head-to-head" claim is falsified on the unified field. |
Timeline
| Fact ID | Statement |
|---|---|
| F-TIME-SESSIONS | 36 research sessions, 2026-02-05 → 2026-07-02. By month: Feb 23, Mar 4, Apr 4, Jul 5 — the project slept ~10 weeks between the X188 keep and the clean-slate/RL/tablebase burst. |
| F-TIME-BIRTH | Project birth 2026-02-05 (the Q-learning/A2C era); first public commit 2026-03-19. |
| F-TIME-AUTORESEARCH | The autoresearch loop was born 2026-04-20; X188 was kept 2026-04-21 — 89 entries in two days. |
| F-TIME-CLEANSLATE | 2026-07-01: the clean-slate arm (114 entries), the Opus control arm, and the RL arc's BC→PPO run all started the same day. 2026-07-02: unified ladder, tablebase solve + audit, exploitability audit, underdog program — and the freeze. |
| F-TIME-FREEZE | Site facts frozen as of July 2, 2026. Every champion claim on the site carries this stamp. |