A project in two bursts, ten weeks apart
Every other chapter argues about strategies. This one is about the work itself. The question here is procedural: over five calendar months, in what order did the ideas arrive, and by whose hand? The record is unusually complete, because the work left two parallel trails — a git history across five repositories, and 36 harvested session summaries spanning February 5 to July 2, 2026 . Read together, they tell a story the win-rate tables cannot: the project ran in two intense bursts separated by a ten-week dormancy, and the second burst — two days in July — produced almost everything that survived to the freeze.
The shape of five months
February opened the project with 23 sessions of hand-built Q-learning and A2C experiments. April turned on the autoresearch loop and produced The Shape Reader (X188) in two days. Then nothing for ten weeks — zero sessions and zero commits across all five repositories in May and June — until the entire modern arc (the rerun, the RL policies, the exact tablebase, the unified ladder) was compressed into two days: July 1 and July 2, 2026 .
One caveat before the log. The raw session transcripts from the February–April era were later lost to a routine 30-day cleanup; what survives from that period are the harvested summaries — a dated paragraph per session, distilled at the time. So the early entries below are reconstructed from those summaries plus the git history, and the July entries from both plus the still-live research journals. Where a summary or a commit names the model that did the work, this page cites it; where it does not, the entry says model not recorded rather than guessing. That distinction is the honest core of the "which models" question, and it is worth stating plainly: the deep-learning era of per-commit model attribution had not yet begun in February.
Which model did which era
Before the log itself, the model dimension deserves its own table, because it is the part
most likely to be over-claimed. The clearest, best-documented facts are at the two ends:
the project was born on February 5 in a session that deliberately enabled
Claude Code's experimental Agent Teams feature, and the entire July burst was
Fable-directed, with an Opus control arm run in parallel
as a deliberate head-to-head. In between, the record thins. The April autoresearch and
site-audit commits carry a Claude Opus 4.7 (1M context) co-author trailer;
the February and March commits predate that convention entirely.
| Era | Dates | Model(s) | Attribution basis |
|---|---|---|---|
| Greenfield & the RL saga | Feb 5–19 | not recorded | Predates per-commit model trailers; Agent Teams enabled Feb 5 |
| MCTS pivot & tournaments | Mar 9–20 | not recorded | No model trailer on March commits |
| Autoresearch loop & X188 | Apr 17–21 | Opus 4.7 (1M context) | Co-author trailer on the Apr 19–21 commits |
| Dormancy | May–Jun | — | No sessions, no commits in any repository |
| Clean-slate day: treatment arm | Jul 1–2 | Fable 5 | Co-author trailer on the fable-cleanslate & main-repo commits |
| Clean-slate day: control arm | Jul 1 | Opus | Explicit control arm (see The Rerun); science commits carry no trailer |
| RL, AlphaZero, tablebase, ladder | Jul 1–2 | Fable 5 | Co-author trailer on the RL and main-repo commits |
The one genuinely designed model comparison is the July rerun. On July 1 the discovery loop was restarted from scratch in two isolated arms: a treatment arm driven by Fable 5, allowed to run long, and an Opus control arm capped at ten iterations. The control arm reached a 55.2% champion (X107) ; at the same iteration count the Fable arm stood at ~55.0% (X109) — comparable — before running 100-plus more iterations to find faucet denial and, finally, The Shutout (X181). Because the two arms differ in both model and budget, this is not a clean model benchmark; it is two independent replays of the same search, and the full comparison lives in The Rerun. What the timeline adds is simply that both ran on the same day, side by side.
The log
The chronology below groups the 36 sessions and five repositories' commits into six eras. Each entry states what was tried, what came of it, and the recorded model. Dates are from the harvested session summaries and git history; where the two disagreed on a founding date, the note resolves it.
-
Greenfield & the reinforcement-learning saga
February 5–19, 2026 · 23 sessions
-
Feb 5
Project born; the game built by a team of agents.
From a PRD and a fresh
model: not recorded (Agent Teams enabled)CLAUDE.md, a team of three parallel agents built the complete cricket engine (12 tasks, 34 tests) in one day, then added a skill-level constraint system and the plan to validate the AI against Frongello's published strategies. The session's own lesson: agents self-organize to unblock a dependency graph when tasks are formally tracked. This is the project's birthday . -
Feb 7–8
Two problems surface: a 0% second-mover and a broken benchmark.
The A2C agent won 0% of games when moving second; the fix expanded the state from 15 to 19 dimensions and replaced frozen self-play with rotation against a bot pool. In parallel, the team found their simulator had only 4 of Frongello's 17 strategies and that perfect-accuracy testing was masking every real difference — so they committed to rebuilding all 17 from the paper.
model: not recorded -
Feb 8–9
The scoring-symmetry bug, and a skill model grounded in real data.
Uniform ~50% win rates across matchups turned out to be a game mechanic, not a strategy bug: when both players close in identical order, no one ever scores. Alongside the fix, a per-player
model: not recordedSkillProfilereplaced the global miss flag with professional accuracy data (Haugh & Wang 2024) — and the pro-profile A2C agent promptly collapsed to a 15.6% win rate under the noisier gradient signal. -
Feb 9–11
The site ships; the MPR bug; the bull-difficulty finding.
A 22×22 tournament runner and the first research site went up on Cloudflare Pages at
model: not recordeddarts.mattalldian.com, and an interactive advisor ported 22 Python decision trees to client-side JS. A capped MPR calculation was found and fixed (realistic range 0.76–5.29), and a bull-difficulty multiplier (0.75, professionals hit the bull ~25% less) reshuffled the field — E1 (EarlyBull) collapsed, S9 emerged as the real winner. -
Feb 18–19
AlphaZero, first attempt — and a flat failure.
A long tournament was restarted on the Windows homelab (20 cores, ~22-hour runtime), and an AlphaZero player with stochastic MCTS was wired into the validation pipeline. Tested against 28 bots it won 0% across the board — conclusively under-trained — and the session pivoted toward MuZero as the next idea. The learned-search thread would not pay off until July.
model: not recorded
-
Feb 5
-
Consolidation: the MCTS pivot
March 9–20, 2026 · 4 sessions
-
Mar 19
First public release of the research repo.
The main repository's first commit — "Initial release: Darts Cricket Strategy Research" — landed on March 19 , six weeks after the project itself began. The birthday and the first public commit are different dates, and the log keeps them distinct.
model: not recorded -
Mar 9–11
Why A2C struggled — and a 37× jump from MCTS.
A diagnostic session concluded that A2C was poisoned by weak bots and too slow to iterate, and that MCTS with a hand-crafted evaluation function fit the fast-experiment loop far better. An autonomous MCTS improvement loop then ran 46 experiments and lifted the win rate from 1.3% to 48% — a 37× gain — and established that hand-tuned heuristic leaves beat generic neural evaluators, and that the ceiling was strategic, not accuracy-limited.
model: not recorded -
Mar 12–20
A 12-billion-game tournament, and an advisor audit.
A large stratified tournament (~12B games) fed the interactive advisor, and a subagent audit caught the subtle failure modes of aggregated data — spread-vs-shift confusion, survivorship bias, and MPR-to-multiplier mapping errors — before publication. The month closed with the analysis infrastructure solid but the strategy frontier still hand-driven.
model: not recorded
-
Mar 19
-
The autoresearch harness is born
April 17–21, 2026 · 4 sessions
-
Apr 17–18
A full design-audit pass on the site.
Before the science accelerated, the site got a disciplined, batch-by-batch quality pass — normalize, typeset, harden, animate, arrange, polish, clarify — migrating the palette to design tokens, adopting a fluid type scale, and closing accessibility findings (contrast, focus rings, 44px touch targets, ARIA). Commit-per-batch, verified at each step.
model: Opus 4.7 (1M context) -
Apr 20
The autoresearch loop turns on.
The closed-loop strategy-discovery harness was built:
model: Opus 4.7 (1M context)run_bench.pygenerates candidate strategies, benches them against a fixed 11-bot pool, and feeds the results back to an LLM agent with a persistent journal. The loop was born April 20 — its first journal entries and scaffold commit are dated the 20th. A C engine, validated within 1.80pp of the Python reference, made the fast benching possible. Overnight it reached a 62.1% mean (X165) against the original pool, up from E12's 54.6%. -
Apr 21
The Shape Reader kept — 89 journal entries in two days.
Morning re-benching on an expanded pool exposed hidden regressions (X165 dropped to 59.2%, lost to the earlier X109), and a minimalist redesign produced the champion: The Shape Reader (X188) — X109 plus an S1 detector plus a narrow finish mechanic. Kept April 21 as journal entry 89 — 89 entries in two days . It would stand unchallenged for ten weeks.
model: Opus 4.7 (1M context)
-
Apr 17–18
-
The silence
May & June 2026 · 0 sessions
-
May–Jun
Ten weeks dormant.
Zero sessions and zero commits in any of the five repositories across May and June . The Shape Reader sat as the unchallenged champion the entire time. This gap is not a gap in the record — it is a record: ten weeks of a champion sitting untouched made it no stronger, and the moment work resumed, two days of adversarial re-examination found four artifacts that beat it. The silence marks the difference between having an answer and having pressure-tested one.
—
-
May–Jun
-
Clean-slate day: three arms at once
July 1, 2026
-
Jul 1
The old loop saturates — and gets deleted.
A final main-lineage session pushed the journal from entry 092 to 115 and kept nothing ; it also proved The Shape Reader's
model: Fable 5mr≤9endgame gate was dead code . On the same day, the discovery loop was restarted from its April 20 birth commit in isolated worktrees — the rerun begins. -
Jul 1
Fable treatment arm vs. Opus control arm.
Two clean-room arms ran in parallel. The Opus control arm, capped at ten iterations, reached X107 at 55.2% mean . The Fable treatment arm found the tie-rule win-tap at its third experiment and faucet denial (+9.4pp) at its eleventh, then kept climbing. Both are told in full in The Rerun.
models: Fable 5 (treatment) · Opus (control) -
Jul 1
The RL arc: imitation, then reward hacking, then 75.8%.
In a clean RL sandbox, a behavior-cloned E12 recovered its teacher exactly (55.1%); PPO fine-tuning first discovered it could refuse to close and hoard points forever , and after a stall-as-loss fix reached a 75.79% pool mean (stochastic policy, 22,000 games) . Determinism and league self-play would follow the next day. This is the thread that ends in The Closer.
model: Fable 5 -
Jul 1
AlphaZero, second attempt — the replay-buffer lesson.
A tabula-rasa value net with batched-leaf expectimax reached 65.6% over the 11-bot pool , but did not beat the 67.8% hand-rolled search ceiling. Its most consequential finding was procedural: a never-evicting replay buffer had anchored run 1 to near-random self-play, and emptying it was worth +9pp . Four months after the February 0% result, AlphaZero finally worked.
model: Fable 5
-
Jul 1
-
Ground truth, the ladder, and the freeze
July 2, 2026
-
Jul 2
The endgame is solved exactly.
An exact tablebase solved the cricket endgame slice — 4.16 billion states — and audited every champion against perfect play. The RL policy had the best endgame of anything built ; The Shape Reader ranked 10th of 13. Validation even surfaced a real RNG seeding bug, fixed with splitmix64 hardening and ported across repos. The full solve is Ground Truth.
model: Fable 5 -
Jul 2
RL sessions 2–3, the exploitability audit, the underdog specialist.
The RL arc reached its league champion — The Closer (it3), 70.41% league mean, 0 stalls — and measured that reading the opponent, though real, was worth less than the keep bar. A best-response exploitability audit and a separate underdog program (The Grinder) ran the same day.
model: Fable 5 -
Jul 2
One ladder — and the freeze.
All three lineages were finally rated on one Elo scale: The Closer on top at 1222, the clean-slate Shutout fourth at 1176, The Shape Reader ninth at 1118, with zero intransitive cycles clearing the noise bar. With the measurements settled, the research was frozen as of July 2, 2026 — the stamp every champion claim on this site carries — and the site redo you are reading began.
model: Fable 5
-
Jul 2
What the log itself teaches
Strip the strategies away and three procedural facts remain. First, the compression: an entire five-month project, and the great majority of its durable results, fit inside two calendar days once the right harness and the will to burn the history were both present. Second, the silence: ten dormant weeks did not advance the work by a single win-probability point — progress came from adversarial re-examination, not from a champion aging in place. Third, the model honesty: the record names Opus 4.7 for the April autoresearch era and Fable 5 for the July burst, with an Opus control arm run deliberately alongside — and for February and March it names no model at all, because none was recorded. A build log that invented one would be exactly the kind of unscoped claim this project spent five months learning to avoid.
That is where the story ends and the advice begins. Everything the project learned about playing cricket well — distilled from the champion's doctrine — is written down next, as a rule card a person can actually follow at the board.