One Ladder — Darts Cricket Research

Everything on one scale, at last

Each earlier chapter measured its champions against a different yardstick. The catalog champion was ranked #1 within a field of 31 hand-built strategies, where The Shape Reader (X188) posted a 60.65% mean on its expanded fourteen-bot pool; the rerun tracked a home-pool mean that grew as its opponent pool grew; the learned policy was scored against an eleven-bot training pool and then a 70.41% self-play league mean. Those numbers are not comparable — a pool mean, a home-pool mean, and a league mean each answer a different question. To say which artifact is actually strongest, they all have to answer the same one.

So we built a single tournament. Twenty-seven artifacts — the best of the main lineage, the clean-slate rerun, the RL arc, and the classic Frongello and experimental baselines — were rated by Bradley–Terry maximum likelihood on 253 full-round-robin C-engine pairings (4,000 games each) plus 58 Python-engine pairings (1,500 games each), anchored at S1 = 1000 . Cross-engine comparison is only honest if the engines agree, so the clean-slate champions were ported into the C engine and validated bit-identical to their home engine, and the Python ports were held to within 1.5pp of C against the S2 baseline . One number, one meaning, for every artifact in the project.

The whole project, ranked

As of July 2, 2026, the strongest artifact in the project is The Closer (it3), the PPO league policy, at Elo 1222 (±2.8, 24,000 games) . The three RL policies take the top three rungs; the best hand-coded strategy, The Shutout (X181), sits fourth at 1176 ; and the old site's hero, The Shape Reader (X188), ranks ninth at 1118 — #1 in its own 31-strategy field, mid-pack once the newer lineages arrive.

The full ladder is below. Heat encodes Elo magnitude — darker is higher — and the lineage column shows how the three families interleave: RL on top, then clean-slate champions, then the main lineage and the classic bots. The margins are real but not enormous: The Closer clears the best hand-coded champion by about 47 Elo, and beats The Shutout 60.0% head-to-head over 1,500 games .

Every rated strategy as a dot at its Elo with a ±standard-error whisker, best at top; the lineage swatch at each row shows how the three families interleave. The Closer (it3) leads at 1222. The same values are tabulated below.

Unified Elo ladder of 27 strategy artifacts across three lineages, Bradley–Terry MLE anchored at S1 = 1000, from 253 C-engine pairings of 4,000 games and 58 Python-engine pairings of 1,500 games. Elo with standard error and total games per artifact. Heat encodes Elo magnitude; darker is higher.
#	Strategy	Lineage	Elo	±SE	Games
1	The Closer (it3)	RL (PPO)	1222	2.8	24,000
2	IT2_det	RL (PPO)	1216	2.7	24,000
3	PPO_run2	RL (PPO)	1206	2.7	24,000
4	The Shutout (X181)	clean-slate	1176	1.7	94,000
5	X168	clean-slate	1173	1.7	94,000
6	X159	clean-slate	1167	1.7	94,000
7	X147	clean-slate	1162	1.7	94,000
8	X129	clean-slate	1135	1.7	94,000
9	The Shape Reader (X188)	main	1118	1.7	88,000
10	X109	main	1113	1.7	88,000
11	X123	clean-slate	1107	1.6	94,000
12	The Textbook (DIST)	RL (distilled)	1085	2.6	24,000
13	X165	main	1073	1.7	88,000
14	X112	main	1071	1.7	88,000
15	H1_Hoarder	clean-slate	1052	1.7	88,000
16	S14	classic	1037	1.6	94,000
17	S10	classic	1032	1.6	94,000
18	E12	classic	1018	1.6	94,000
19	X103	main	1018	1.7	88,000
20	E10	classic	1014	1.7	88,000
21	S2	classic	1014	1.6	94,000
22	E3	classic	1012	1.6	94,000
23	S1 (anchor)	classic	1000	0.0	88,000
24	E1	classic	992	1.7	94,000
25	E11	classic	990	1.7	88,000
26	PS	baseline	987	1.7	88,000
27	S6	classic	950	1.7	94,000

Shading: Elo rating 950 → 1222.

The ladder settles three questions the earlier chapters could only pose. The RL policies are internally consistent — The Closer beats IT2_det 54.5%, which beats PPO_run2 52.9% , matching their order on the scale. The clean-slate rerun genuinely surpassed the main lineage: every clean-slate champion from X129 upward (1135–1176) rates above The Shape Reader (1118), and The Shutout beats it 54.6% head-to-head . The earlier +2.66pp spot check that launched the rerun chapter was, it turns out, only the tip of a ∼58-Elo gap . And The Textbook (DIST), the interpretable distillation, lands at 1085 — recovering about a third of the Elo distance from the baselines to its teacher, beating all seven baseline representatives it faced but losing roughly 62% to the X-series champions .

The ladder also caught something a home-pool mean would have hidden. Within the main lineage, only The Shape Reader actually improved on its predecessor X109 on the unified field: X112 (1071) and X165 (1073), both logged as "keeps" against their home frontier pool, rate about 40 Elo below X109 (1113) . Tuning against a fixed pool made them better at beating that pool and quietly worse at the game — overfitting, visible only once a wider field was there to expose it.

A total order: no strategy beats the strategy that beats it

A ladder is only meaningful if the game supports one. Rock–paper–scissors has no ladder: every strategy loses to something it beats, and any linear ranking is a lie. The technical question is whether the strategy space contains such intransitive cycles. If A reliably beats B, B reliably beats C, and C reliably beats A, no single scale can be honest. So before trusting the ladder, we scanned for cycles.

Over every ordered triple of the 27 artifacts, requiring each edge to clear 52% (roughly 2.5σ at 4,000 games, so a cycle has to be signal, not noise), the result is unambiguous.

Key finding

Zero intransitive cycles clear the 52% bar — to measurement precision, the 27-strategy field is a total order. Drop the bar to a bare 50% and 30 cycles appear, but all inside the noise: the strongest near-cycle, E11 → S1 → S14 → E11, has a weakest edge of just 51.9% . Intransitivity, where it exists at all, is confined to a ±2pp band around even.

This is a genuine property of the game, not an artifact of the rating method. Cricket rewards the same things regardless of who is playing — deny the opponent's income, close in value order, respect the tie rule — so being better at those things helps against everyone, and skill lands on a line rather than a wheel. The Bradley–Terry model fits that line well: mean absolute edge error 5.08pp, worst residual 16.53pp . The residual that survives lives in exactly the ±2pp band the cycle scan flagged — a thin film of style-matchup noise floating on top of a real skill ordering. The ladder is not a convenient fiction; the game earned it.

Two more lenses, one verdict

A single measurement, however careful, can encode its own blind spot. The convincing thing about The Closer's position is that two independent audits — built for different purposes, run on different data — rank it the same way. The first is the endgame tablebase audit from the previous chapter: scored against an exact solve of the cricket endgame, The Closer has the best endgame in the project, agreeing with perfect play on 74.0% of its decisions and leaking 2.7 win-probability points per game (71,961 decisions) — while The Shape Reader ranks 10th of 13 there, forfeiting 5.2 points per game . The ladder and the tablebase never shared a metric, yet they agree on both the champion and the laggard.

The second lens is adversarial. Elo asks how an artifact does against the field as it stands; exploitability asks the harder question — how it does against an opponent trained specifically to beat it. For each target we warm-started a PPO best-response from the champion and let it train only to exploit that one frozen strategy, then scored it over 2,000 games . Lower is better: a strategy that a dedicated adversary can only narrowly beat has no exploitable hole.

Exploitability audit: greedy win rate of a PPO best-response adversary trained specifically against each frozen target, 2,000 games per target. Lower is harder to exploit. Heat encodes the best-response win rate; darker is more exploitable.
Target	Best-response win rate (greedy)	(sampled)
The Closer (it3)	53.2%	53.0%
The Shutout (X181)	59.2%	57.3%
X129	61.1%	60.8%
The Shape Reader (X188)	72.1%	69.7%
E12	82.1%	81.2%

Shading: best-response win rate 53% → 82%.

The ranking is the ladder's, again. A dedicated adversary trained only to beat The Closer reaches barely better than a coin flip — 53.2% . The same attack beats The Shape Reader 72% of the time and the classic bot E12 82% . And the finding carries a pointed irony for the old hero. The Shape Reader's edge was its detector branches — machinery for recognizing an opponent's pattern and chasing it. But a detector is a rule, and a rule can be baited: the very patches that made The Shape Reader read the board became the attack surface an adversary could steer. The Closer has no such lever to pull, because it never committed to a fixed rule to exploit. Three lenses — Elo against the field, agreement with exact play, resistance to a tailored adversary — and one verdict, as of July 2, 2026.

Five months, thirty-six sessions

With the measurements settled, the last thing worth ranking is the project itself. The work leaves a paper trail: 36 working sessions were harvested between February 5 and July 2, 2026 , and their monthly distribution tells the real story of how this research was actually done — in two bursts separated by a long silence.

Research sessions per month, February–July 2026. Two bursts — a 23-session February and a five-session July — bracket a ten-week silence where May and June register zero.

Working sessions per month, February through July 2026, from harvested session summaries (36 sessions total), with the milestone that defined each month.
Month	Sessions	What happened
February	23	Project birth (Feb 5); the Q-learning and A2C era
March	4	First public release of the research repo (Mar 19)
April	4	Autoresearch loop born (Apr 20); The Shape Reader kept (Apr 21)
May	0	The project slept — ten weeks dormant
June	0	Still dormant
July	5	Clean-slate day (Jul 1): rerun, RL, tablebase; the freeze (Jul 2)

The shape is stark. February opened the project with 23 sessions — the Q-learning and A2C experiments — and the first public commit landed March 19 . April is where the machine turned on: the autoresearch loop was born on April 20, and The Shape Reader was kept the very next day, April 21 — entry 89 of a journal that reached 89 entries in two days . Then nothing for ten weeks: zero sessions in May or June , while The Shape Reader sat as the unchallenged champion. The entire modern arc of this project — the rerun, the RL policies, the exact tablebase, the exploitability audit, the underdog specialist, and the unified ladder itself — is compressed into two days: the clean-slate day of July 1 and the audit-and-freeze of July 2 .

That compression is the point. Ten weeks of the champion sitting untouched did not make it stronger; two days of adversarial re-examination found four artifacts that beat it. The gap between the two bursts is the most honest thing in the timeline — it marks the difference between having an answer and having pressure-tested it.

Three times, the same wall

The most reassuring pattern in the record is convergence: separate arms of the project, working in isolation, kept arriving at the same conclusions. When three independent searches hit the same wall, the wall is probably real.

The first convergence was on denial — the discovery that the biggest lever in cricket is not scoring faster but starving the opponent's income. The clean-slate rerun found it as "faucet denial," worth +9.4pp in a single step — "the largest single mechanic found in this entire project" . The RL policy learned the same instinct from reward alone, converting it into the tablebase's dominant optimal motif: score on the opponent's open lane before closing your own . And the underdog program made denial its foundation — its specialist, The Grinder (X214_UD), learned to deny at any deficit . Three arms, one lever.

The second convergence was on opponent memory, and its limits. The Shape Reader's detector branches, the RL arc's opponent-inference experiments, and the exploitability probes all circled the same question: how much is it worth to model who you're playing? The RL arc answered it by measurement, and the answer was humbling. The policy's features do carry opponent identity (37–38% top-1 over 15 classes against 6.7% chance), and it does learn to condition on them (12.3% of its greedy actions flip by opponent) — yet its league mean stayed flat . The full best-response envelope above the champion's fixed mixture summed to only about +1.8pp of league mean — roughly +13–14 Elo — below the keep bar . Reading the opponent is real, and against a strong general policy it is nearly worthless. The Shape Reader's detectors, the same machinery, are exactly what made it the most exploitable of the modern champions.

The convergences are why the total order holds. If reading opponents paid off richly, the field would be full of counters-to-counters and the ladder would buckle into cycles. It stays a line because the winning skills — denial, value-order closing, tie discipline — are the same against everyone.

What the project taught

Three lessons survived every arm of the work, and they are less about darts than about how to run a search for a good answer.

Journals beat memory. The main lineage reached 115 journal entries and the clean-slate rerun 114, each entry a dated, reproducible record with its keep bar and its benchmark. That discipline is what let a rerun months later independently re-derive main-lineage findings — the dead-code endgame gate, the value-ordered lane-selection law — and recognize them as basin convergence rather than coincidence . A result you cannot point to a journal entry for is a result you cannot trust twice.

Adversarial verification beats confidence. The Shape Reader looked unbeatable — #1 at all 11 skill levels of its sweep . That confidence was an artifact of the field it was tested in. Every hard number that dislodged it came from a lens built to attack, not to confirm: the exact tablebase, the trained best-response probes, the unified ladder that finally included stronger opponents. Confidence measured against a friendly field is just the field's shape reflected back.

Burn your history periodically. The single most productive decision of the July burst was to throw the accumulated lineage away and start clean. The context-isolated rerun, carrying none of the main lineage's 89 entries of habit, found the tie-rule "win-tap" at iteration 3 — a mechanic the main lineage never found in 115 entries , and the tablebase later certified that exact omission as The Shape Reader's biggest blunder class. Accumulated history is momentum, and momentum has a direction; sometimes the only way to find the turn you have been missing is to forget the road that brought you here.

Where the story ends

The research is frozen as of July 2, 2026. The champion is The Closer (it3) at Elo 1222 : strongest on the unified ladder, best against the exact endgame, and the hardest of all audited artifacts to exploit. What the project knows about playing cricket well is written down in the rule card — the champion's doctrine, distilled into advice a person can follow.