The Build Log

Five months, 36 working sessions, five repositories, and more than one AI model. This is how the project actually unfolded — what was tried when, who (or what) did the trying, and the ten-week silence that turned out to be the most honest entry in the log.

Research frozen — results as of July 2, 2026

A project in two bursts, ten weeks apart

Every other chapter argues about strategies. This one is about the work itself. The question here is procedural: over five calendar months, in what order did the ideas arrive, and by whose hand? The record is unusually complete, because the work left two parallel trails — a git history across five repositories, and 36 harvested session summaries spanning February 5 to July 2, 2026 . Read together, they tell a story the win-rate tables cannot: the project ran in two intense bursts separated by a ten-week dormancy, and the second burst — two days in July — produced almost everything that survived to the freeze.

The shape of five months

February opened the project with 23 sessions of hand-built Q-learning and A2C experiments. April turned on the autoresearch loop and produced The Shape Reader (X188) in two days. Then nothing for ten weeks — zero sessions and zero commits across all five repositories in May and June — until the entire modern arc (the rerun, the RL policies, the exact tablebase, the unified ladder) was compressed into two days: July 1 and July 2, 2026 .

Sessions per month, February–July 2026. The May–June gap — zero sessions, ten weeks — is the empty space between the two bursts.

One caveat before the log. The raw session transcripts from the February–April era were later lost to a routine 30-day cleanup; what survives from that period are the harvested summaries — a dated paragraph per session, distilled at the time. So the early entries below are reconstructed from those summaries plus the git history, and the July entries from both plus the still-live research journals. Where a summary or a commit names the model that did the work, this page cites it; where it does not, the entry says model not recorded rather than guessing. That distinction is the honest core of the "which models" question, and it is worth stating plainly: the deep-learning era of per-commit model attribution had not yet begun in February.

Which model did which era

Before the log itself, the model dimension deserves its own table, because it is the part most likely to be over-claimed. The clearest, best-documented facts are at the two ends: the project was born on February 5 in a session that deliberately enabled Claude Code's experimental Agent Teams feature, and the entire July burst was Fable-directed, with an Opus control arm run in parallel as a deliberate head-to-head. In between, the record thins. The April autoresearch and site-audit commits carry a Claude Opus 4.7 (1M context) co-author trailer; the February and March commits predate that convention entirely.

Model attribution by era. "Attribution basis" states where the model claim comes from — a commit trailer, a session summary, or an explicit chapter — and marks the eras where no model is recorded.
Era Dates Model(s) Attribution basis
Greenfield & the RL saga Feb 5–19 not recorded Predates per-commit model trailers; Agent Teams enabled Feb 5
MCTS pivot & tournaments Mar 9–20 not recorded No model trailer on March commits
Autoresearch loop & X188 Apr 17–21 Opus 4.7 (1M context) Co-author trailer on the Apr 19–21 commits
Dormancy May–Jun No sessions, no commits in any repository
Clean-slate day: treatment arm Jul 1–2 Fable 5 Co-author trailer on the fable-cleanslate & main-repo commits
Clean-slate day: control arm Jul 1 Opus Explicit control arm (see The Rerun); science commits carry no trailer
RL, AlphaZero, tablebase, ladder Jul 1–2 Fable 5 Co-author trailer on the RL and main-repo commits

The one genuinely designed model comparison is the July rerun. On July 1 the discovery loop was restarted from scratch in two isolated arms: a treatment arm driven by Fable 5, allowed to run long, and an Opus control arm capped at ten iterations. The control arm reached a 55.2% champion (X107) ; at the same iteration count the Fable arm stood at ~55.0% (X109) — comparable — before running 100-plus more iterations to find faucet denial and, finally, The Shutout (X181). Because the two arms differ in both model and budget, this is not a clean model benchmark; it is two independent replays of the same search, and the full comparison lives in The Rerun. What the timeline adds is simply that both ran on the same day, side by side.

The log

The chronology below groups the 36 sessions and five repositories' commits into six eras. Each entry states what was tried, what came of it, and the recorded model. Dates are from the harvested session summaries and git history; where the two disagreed on a founding date, the note resolves it.

  1. Greenfield & the reinforcement-learning saga

    February 5–19, 2026 · 23 sessions

    1. Feb 5

      Project born; the game built by a team of agents.

      From a PRD and a fresh CLAUDE.md, a team of three parallel agents built the complete cricket engine (12 tasks, 34 tests) in one day, then added a skill-level constraint system and the plan to validate the AI against Frongello's published strategies. The session's own lesson: agents self-organize to unblock a dependency graph when tasks are formally tracked. This is the project's birthday .

      model: not recorded (Agent Teams enabled)
    2. Feb 7–8

      Two problems surface: a 0% second-mover and a broken benchmark.

      The A2C agent won 0% of games when moving second; the fix expanded the state from 15 to 19 dimensions and replaced frozen self-play with rotation against a bot pool. In parallel, the team found their simulator had only 4 of Frongello's 17 strategies and that perfect-accuracy testing was masking every real difference — so they committed to rebuilding all 17 from the paper.

      model: not recorded
    3. Feb 8–9

      The scoring-symmetry bug, and a skill model grounded in real data.

      Uniform ~50% win rates across matchups turned out to be a game mechanic, not a strategy bug: when both players close in identical order, no one ever scores. Alongside the fix, a per-player SkillProfile replaced the global miss flag with professional accuracy data (Haugh & Wang 2024) — and the pro-profile A2C agent promptly collapsed to a 15.6% win rate under the noisier gradient signal.

      model: not recorded
    4. Feb 9–11

      The site ships; the MPR bug; the bull-difficulty finding.

      A 22×22 tournament runner and the first research site went up on Cloudflare Pages at darts.mattalldian.com, and an interactive advisor ported 22 Python decision trees to client-side JS. A capped MPR calculation was found and fixed (realistic range 0.76–5.29), and a bull-difficulty multiplier (0.75, professionals hit the bull ~25% less) reshuffled the field — E1 (EarlyBull) collapsed, S9 emerged as the real winner.

      model: not recorded
    5. Feb 18–19

      AlphaZero, first attempt — and a flat failure.

      A long tournament was restarted on the Windows homelab (20 cores, ~22-hour runtime), and an AlphaZero player with stochastic MCTS was wired into the validation pipeline. Tested against 28 bots it won 0% across the board — conclusively under-trained — and the session pivoted toward MuZero as the next idea. The learned-search thread would not pay off until July.

      model: not recorded
  2. Consolidation: the MCTS pivot

    March 9–20, 2026 · 4 sessions

    1. Mar 19

      First public release of the research repo.

      The main repository's first commit — "Initial release: Darts Cricket Strategy Research" — landed on March 19 , six weeks after the project itself began. The birthday and the first public commit are different dates, and the log keeps them distinct.

      model: not recorded
    2. Mar 9–11

      Why A2C struggled — and a 37× jump from MCTS.

      A diagnostic session concluded that A2C was poisoned by weak bots and too slow to iterate, and that MCTS with a hand-crafted evaluation function fit the fast-experiment loop far better. An autonomous MCTS improvement loop then ran 46 experiments and lifted the win rate from 1.3% to 48% — a 37× gain — and established that hand-tuned heuristic leaves beat generic neural evaluators, and that the ceiling was strategic, not accuracy-limited.

      model: not recorded
    3. Mar 12–20

      A 12-billion-game tournament, and an advisor audit.

      A large stratified tournament (~12B games) fed the interactive advisor, and a subagent audit caught the subtle failure modes of aggregated data — spread-vs-shift confusion, survivorship bias, and MPR-to-multiplier mapping errors — before publication. The month closed with the analysis infrastructure solid but the strategy frontier still hand-driven.

      model: not recorded
  3. The autoresearch harness is born

    April 17–21, 2026 · 4 sessions

    1. Apr 17–18

      A full design-audit pass on the site.

      Before the science accelerated, the site got a disciplined, batch-by-batch quality pass — normalize, typeset, harden, animate, arrange, polish, clarify — migrating the palette to design tokens, adopting a fluid type scale, and closing accessibility findings (contrast, focus rings, 44px touch targets, ARIA). Commit-per-batch, verified at each step.

      model: Opus 4.7 (1M context)
    2. Apr 20

      The autoresearch loop turns on.

      The closed-loop strategy-discovery harness was built: run_bench.py generates candidate strategies, benches them against a fixed 11-bot pool, and feeds the results back to an LLM agent with a persistent journal. The loop was born April 20 — its first journal entries and scaffold commit are dated the 20th. A C engine, validated within 1.80pp of the Python reference, made the fast benching possible. Overnight it reached a 62.1% mean (X165) against the original pool, up from E12's 54.6%.

      model: Opus 4.7 (1M context)
    3. Apr 21

      The Shape Reader kept — 89 journal entries in two days.

      Morning re-benching on an expanded pool exposed hidden regressions (X165 dropped to 59.2%, lost to the earlier X109), and a minimalist redesign produced the champion: The Shape Reader (X188) — X109 plus an S1 detector plus a narrow finish mechanic. Kept April 21 as journal entry 89 — 89 entries in two days . It would stand unchallenged for ten weeks.

      model: Opus 4.7 (1M context)
  4. The silence

    May & June 2026 · 0 sessions

    1. May–Jun

      Ten weeks dormant.

      Zero sessions and zero commits in any of the five repositories across May and June . The Shape Reader sat as the unchallenged champion the entire time. This gap is not a gap in the record — it is a record: ten weeks of a champion sitting untouched made it no stronger, and the moment work resumed, two days of adversarial re-examination found four artifacts that beat it. The silence marks the difference between having an answer and having pressure-tested one.

  5. Clean-slate day: three arms at once

    July 1, 2026

    1. Jul 1

      The old loop saturates — and gets deleted.

      A final main-lineage session pushed the journal from entry 092 to 115 and kept nothing ; it also proved The Shape Reader's mr≤9 endgame gate was dead code . On the same day, the discovery loop was restarted from its April 20 birth commit in isolated worktrees — the rerun begins.

      model: Fable 5
    2. Jul 1

      Fable treatment arm vs. Opus control arm.

      Two clean-room arms ran in parallel. The Opus control arm, capped at ten iterations, reached X107 at 55.2% mean . The Fable treatment arm found the tie-rule win-tap at its third experiment and faucet denial (+9.4pp) at its eleventh, then kept climbing. Both are told in full in The Rerun.

      models: Fable 5 (treatment) · Opus (control)
    3. Jul 1

      The RL arc: imitation, then reward hacking, then 75.8%.

      In a clean RL sandbox, a behavior-cloned E12 recovered its teacher exactly (55.1%); PPO fine-tuning first discovered it could refuse to close and hoard points forever , and after a stall-as-loss fix reached a 75.79% pool mean (stochastic policy, 22,000 games) . Determinism and league self-play would follow the next day. This is the thread that ends in The Closer.

      model: Fable 5
    4. Jul 1

      AlphaZero, second attempt — the replay-buffer lesson.

      A tabula-rasa value net with batched-leaf expectimax reached 65.6% over the 11-bot pool , but did not beat the 67.8% hand-rolled search ceiling. Its most consequential finding was procedural: a never-evicting replay buffer had anchored run 1 to near-random self-play, and emptying it was worth +9pp . Four months after the February 0% result, AlphaZero finally worked.

      model: Fable 5
  6. Ground truth, the ladder, and the freeze

    July 2, 2026

    1. Jul 2

      The endgame is solved exactly.

      An exact tablebase solved the cricket endgame slice — 4.16 billion states — and audited every champion against perfect play. The RL policy had the best endgame of anything built ; The Shape Reader ranked 10th of 13. Validation even surfaced a real RNG seeding bug, fixed with splitmix64 hardening and ported across repos. The full solve is Ground Truth.

      model: Fable 5
    2. Jul 2

      RL sessions 2–3, the exploitability audit, the underdog specialist.

      The RL arc reached its league champion — The Closer (it3), 70.41% league mean, 0 stalls — and measured that reading the opponent, though real, was worth less than the keep bar. A best-response exploitability audit and a separate underdog program (The Grinder) ran the same day.

      model: Fable 5
    3. Jul 2

      One ladder — and the freeze.

      All three lineages were finally rated on one Elo scale: The Closer on top at 1222, the clean-slate Shutout fourth at 1176, The Shape Reader ninth at 1118, with zero intransitive cycles clearing the noise bar. With the measurements settled, the research was frozen as of July 2, 2026 — the stamp every champion claim on this site carries — and the site redo you are reading began.

      model: Fable 5

What the log itself teaches

Strip the strategies away and three procedural facts remain. First, the compression: an entire five-month project, and the great majority of its durable results, fit inside two calendar days once the right harness and the will to burn the history were both present. Second, the silence: ten dormant weeks did not advance the work by a single win-probability point — progress came from adversarial re-examination, not from a champion aging in place. Third, the model honesty: the record names Opus 4.7 for the April autoresearch era and Fable 5 for the July burst, with an Opus control arm run deliberately alongside — and for February and March it names no model at all, because none was recorded. A build log that invented one would be exactly the kind of unscoped claim this project spent five months learning to avoid.

That is where the story ends and the advice begins. Everything the project learned about playing cricket well — distilled from the champion's doctrine — is written down next, as a rule card a person can actually follow at the board.