The AI Agent

Trained from scratch, the agent settled on a turn-order-dependent opening that no catalog strategy uses

This page documents a separate line of work from the Shape Reader. The Shape Reader is a hand-designed strategy that wins the pro-skill bench; this page is about a reinforcement-learning agent trained from scratch to see what a learner would discover without being handed the catalog. The two are measured differently (the agent against 11 curated hard bots, the Shape Reader against a 14-bot pool) and are not directly comparable; the agent’s point is qualitative, not a rankings claim.

The Question

The tournament analysis identified 30 hand-crafted strategies and tested them exhaustively. Every one of those strategies was designed by a human, with assumptions about how the game should be played. What would a learner discover if we removed the human from the loop?

We trained a reinforcement learning agent to play darts cricket from scratch, with no strategy rules, no opening book, and no notion of “scoring mode” or “covering mode.” The agent sees only the board state and learns from wins and losses. After millions of games against the strongest hand-crafted bots, it discovered something none of them do: it adapts its opening based on turn order.

What the Agent Learned

Turn-Order-Dependent Opening

Going first → always open 20 (100% consistency).
Going second → always open 18 (93% consistency).

No Frongello strategy, and none of the experimental strategies (E1–E12), adapts to turn order. All 28 hand-crafted bots play identically regardless of who throws first. The agent discovered this distinction entirely on its own.

The logic is intuitive once you see it. Going first, you have the tempo advantage — secure the highest-value target (20) immediately. Going second, your opponent likely started on 20, so pivot to 18 and avoid direct competition on the same number. After the opening, the agent follows a consistent closing order: 19 next, then fill the remaining targets.

Adaptive Mid-Game Behavior

Beyond the opening, the agent dynamically balances scoring and closing based on game state:

The agent also shows opponent-specific adaptation. Against S2 (the strongest Frongello strategy), it flips to 57% closing when behind — respecting S2’s sophisticated covering logic by racing to close rather than trying to outscore it.

Chasing: Almost Never

When the opponent threatens a target (2 marks toward closing), the agent chases only 31% of the time. It mostly ignores the opponent’s progress and follows its own plan — independently confirming Frongello’s “never chase” finding through pure trial and error.

Three approaches

Getting to a working agent required three different architectures and 19 iterations. Each failure pointed at a specific constraint: state-space size, action-space structure, and opponent curriculum, in that order.

Attempt 1: Tabular Q-Learning

The simplest approach: maintain a lookup table mapping every game state to the value of each action. After each game, update the table using the Bellman equation. This works beautifully for small games, but darts cricket’s state space — two scores plus 14 mark counts — creates millions of unique states. The table grows unbounded, most entries are visited only once, and the agent plateaus around 45% win rate against basic bots.

Attempt 2: Deep Q-Network (DQN)

Replace the table with a neural network that generalizes across similar states. A standard fully-connected network (128×128) with experience replay and a frozen target network. This should scale better — but it struggled for a subtle reason.

DQN treats the 21 possible actions (7 targets × 3 hit types) as independent, unrelated choices. The network has no structural understanding that “aim at 20” and “aim at 19” are similar decisions, or that “single” and “triple” differ only in magnitude. Every relationship must be learned from scratch, and gradient updates for one action interfere with others. DQN never broke past 45% win rate.

Attempt 3: Branching Actor-Critic (A2C)

The architecture that worked. Instead of one flat output, the network splits into three specialized heads:

Input: 19 features (score differential, 14 marks, 4 aggregate counts) ↓ Shared Trunk: 128 → ReLU → 128 → ReLU ↓ ┌─────────────────────────────────────┐ │ │ Target Head HitType Head Value Head 64 → ReLU → 7 64 → ReLU → 3 64 → ReLU → 1 (which number?) (single/dbl/tri?) (how good is this state?)

Each head learns its own domain. The target head figures out which number to aim at based on board state; the hit-type head learns how aggressively to aim based on skill level. The value head provides the temporal-difference signal that drives learning. Two critical mechanisms keep the agent honest:

Training: 12 bugs and 3 course corrections

The architecture was only the first step. Making it actually learn took 19 versions and the resolution of 12 distinct training bugs. Three specific interventions shaped the final result.

1. Supervised pre-training

A randomly initialized network explores blindly — throwing darts at every number equally, learning nothing for thousands of games. The solution: bootstrap the network by watching expert strategy bots play. We generated training data from four Frongello strategies (S1, S2, S6, S10), recording every dart decision, and trained the network via supervised learning with label smoothing (0.1) to prevent the output probabilities from collapsing to near-certainty.

After pre-training, the network opens with 20-triple ~91% of the time — sensible but not locked in. It has a starting intuition but remains open to discovering something better.

2. Prune the easy opponents

This turned out to be the single most important change. Version 16 trained against all 28 bots (S1–S17 plus E1–E12). The agent won 48.5% of its games overall, but against the 13 hardest bots it won only 43%, worse than the simple sequential closer (S1 at 49.1%).

The problem: easy opponents like E6 (99.7% win rate) and E7 (86.7%) produced a “keep doing what you’re doing” gradient signal. Hard opponents produced a “you need to improve” signal. These contradicted each other, and the easy wins drowned out the hard losses.

The fix was simple: remove every bot the agent could beat more than 45% of the time. With only 13 hard bots remaining, every game produced useful learning signal. Win rate climbed from 5% to 55% over 2 million games, and the novel 20/18 opening split emerged.

3. The entropy floor

Version 17 peaked at 55% win rate after ~4.4 million games — then started declining. The agent was overtraining: its entropy coefficient had annealed to 0.005, making the policy nearly deterministic. It could no longer adapt to slight variations in opponent behavior.

Raising the entropy floor from 0.005 to 0.01 solved the problem completely. Version 19, trained with the higher floor, held stable at 49.3% win rate with zero decline over 2 million additional games. The lower absolute number reflects a tighter bot pool (the two easiest remaining bots were also removed), not a weaker agent.

The 12 Bugs

Beyond the three breakthroughs, the agent’s development required solving 12 distinct technical problems. Each one caused either training collapse, reward hacking, or silently invalid results.

1. Bull Reward Bias

Agent throws bull 100% of the time. Closing bonus × face value (25) makes bull the highest-reward target. Fix: terminal-only rewards.

2. Peaked Pre-training

After supervised pre-training, softmax outputs collapse to 99.99% on one action. Entropy bonus can’t recover. Fix: label smoothing (0.1), fewer epochs.

3. Infinite Games

Degenerate policy never closes all 7 targets. Trajectory buffer grows to 4.4GB. Fix: 100-turn maximum per game.

4. Masking Mismatch

Bull-triple masking applied during action selection but not during gradient computation. Distribution mismatch corrupts training. Fix: consistent masking in both paths.

5. Self-Play Collapse

Both players updating simultaneously chase each other into degenerate strategies. Fix: freeze Player 2, train only Player 1.

6. Reward Hacking

Intermediate rewards for closing targets cause the agent to loop on 20-triple forever. Fix: remove all intermediate rewards entirely.

7. Entropy Annealing Bug

Constructor passes custom initial entropy, but annealing function hardcodes the default. Fix: store per-instance initial value.

8. Dead-Target Blindness

Without masking, the network can’t learn to avoid closed targets from terminal rewards alone — the signal is too diluted across 60 darts. Fix: dead-target action masking.

9. Double-Counted Rewards

Win reward applied during gameplay AND at game end = +200 instead of +100. Fix: compute_reward() always returns 0; terminal rewards applied exclusively at game end.

10. Deterministic Games

With miss_enabled=False, every triple lands. First player always wins by one turn. All v4–v8 results were invalidated. Fix: enable realistic miss probabilities.

11. Easy Bot Pollution

Easy opponents drown out hard-opponent learning signals. Agent converges to S1 behavior (43% vs hard bots). Fix: prune to hard-only bot pool.

12. Overtraining

Entropy floor too low (0.005). Policy becomes deterministic, loses adaptability. Win rate declines 3% past 4.4M games. Fix: entropy floor 0.01.

Version History

v1–v3 · Tabular Q-Learning

Bellman updates on a state-action lookup table. Plateaus at ~45% WR against basic bots. State space too large for tabular methods.

v4–v8 · Deep Q-Network

Double DQN with experience replay. Fails at ~45% WR. Flat action space prevents structured learning. Results later invalidated — deterministic game mechanics (Bug 10) made first-mover always win.

v9–v15 · A2C Development

Branching architecture implemented. Bugs 1–10 discovered and fixed iteratively. Supervised pre-training pipeline built. Agent begins learning meaningful strategies but can’t break past 50%.

v16 · First Full Run (All 28 Bots)

First end-to-end A2C training. 48.5% overall WR, but only 43% against hard bots — worse than the trivial S1 strategy. Easy opponents poison the learning signal.

v17 · Hard bots only

Pruned to 13 hard bots. Win rate climbed from 5% to 55% over 2M games, and the novel 20/18 turn-order-dependent opening emerged. Overtraining began after 4.4M games, traceable to an entropy floor set too low.

v18 · Progressive Pruning

Removed S4 (62% WR) and S8 (66% WR) as the agent began farming them. 11 remaining bots all in the 47–57% range. WR stabilizes at ~50% with tighter competition.

v19 · Entropy floor

Entropy floor raised to 0.01. 49.3% win rate with zero decline over 2M games. The agent holds its adaptability. Final stable model.

Final Results

The v19 agent plays at pro skill level (MPR ~5.3) against the 11 hardest hand-crafted bots. Its 49.3% average win rate means it holds its own against a curated field where every opponent is a strong strategy — not a single easy win in the pool.

Opponent Win Rate Notes
E11 (Kitchen Sink) 53.2% Best matchup — complexity works against E11
S6 (Lead + Extra Darts) 52.1% Extra darts disrupt S6’s closing tempo
S16 (Chase + Extra + High Thresh.) 50.3%
S14 (Chase + Extra + Low Thresh.) 50.2%
E9 (Phase Shift — 3 phases) 49.2%
S10 (Chase + Low Thresh.) 49.0%
E5 (Smart Aim) 48.3%
E1 (Early Bull) 48.3%
E3 (Greedy Close-and-Score) 44.9%
E10 (Score Surge) 44.8%
S2 (Lead Then Cover) 42.9% Toughest opponent — S2’s sophisticated covering logic

All games played at pro skill level with realistic miss probabilities. Win rates measured over 2 million games with entropy floor 0.01.

What it means

Independent confirmation of Frongello

The agent rediscovered several Frongello principles without being told them. It learned to score before covering (76% closing early, shifting to 69–72% scoring once foundations are built). It learned not to chase (31% chase rate when threatened). These principles emerged from pure win/loss feedback, which is an independent line of support for the claim that they reflect strategic structure in the game rather than a habit of human strategy design.

A dimension the catalog did not parameterize

The turn-order-dependent opening is not a refinement of an existing strategy. No bot in the 28-strategy catalog adapts its opening to who throws first. The agent found a dimension that the catalog never parameterized and acted on it. That suggests other unparameterized dimensions (opponent modeling, dart-within-turn sequencing, adaptive thresholds) might also carry value, though this page does not test that claim.

Curriculum mattered more than architecture

The single most impactful change across all 19 versions was not the network architecture, the reward function, or the learning rate. It was removing easy opponents from the training pool. When every game produces useful signal the agent learns; when easy wins dilute hard losses it does not. We do not claim this generalizes to other RL systems, but it is consistent with observations from other self-play work.