Building the Player-Coach Loop

How the PlayerAgent, CoachAgent, and constraint schema work together in practice.

May 21, 2026

The player-coach system is an adversarial quality loop for trading decisions. A PlayerAgent proposes trading actions — enter long AMZN at 264.14, size 3% of portfolio — against a formal constraint schema that specifies position limits, leverage caps, risk/reward minimums, and daily loss thresholds. A CoachAgent evaluates every proposed action against every constraint, mechanically, and either approves or rejects with named violations. The Player revises and resubmits. The full exchange — every proposal, every rejection, every named violation, every approval — is recorded as a structured artifact in SQLite.

The preceding essay (”The Adversarial Quality Loop”) described the structure and why it creates quality. This essay shows the implementation.

The Python package is on PyPI. The dashboard requires cloning the repository:

# Python package — agents, loop, artifacts, backtesting

pip install player-coach-core[llm]

export ANTHROPIC_API_KEY=your-key

# Dashboard — clone the repo, then install Streamlit deps

git clone https://github.com/MaverickHQ/crucible-player-coach

pip install player-coach-core[dashboard]

streamlit run dashboard/app.py

A live dashboard is running at [STREAMLIT_URL]. The rest of this essay explains how it works.

What the PlayerAgent actually does

The PlayerAgent receives a world state — symbol, price, five-day and ten-day moving averages, volume, current session, volatility regime, and current portfolio position — and returns 1–3 proposed trading actions.

from player_coach.agents.player import PlayerAgent

from player_coach.constraints.schema import ConstraintSchema

import json

from pathlib import Path

constraints = ConstraintSchema.from_dict(json.loads(Path(”examples/constraints/conservative.json”).read_text()))

player = PlayerAgent(api_key=os.environ[”ANTHROPIC_API_KEY”])

world_state = {

“symbol”: “AMZN”,

“price”: 264.14,

“sma5”: 267.26,

“sma10”: 270.07,

“volume”: 40_672_200,

“position”: “flat”,

“volatility_regime”: “low”,

“session”: “NY_open”,

}

result = player.decide(world_state, constraints)

The result is a structured dict:
{

“actions”: [

{

“action_type”: “enter_long”,

“symbol”: “AMZN”,

“size_pct”: 0.03,

“entry_price”: 264.14,

“stop_loss”: 259.14,

“take_profit”: 274.14

}

],

“reasoning”: “AMZN is trading below both moving averages, indicating downward pressure, but the low volatility regime and NY open session suggest controlled conditions. A modest long position targets a 2.0 risk/reward ratio, staying well within the 0.03 max_single_trade_pct constraint.”

}

The PlayerAgent calls claude-haiku-4-5-20251001 with a system prompt that specifies the output format exactly and forbids markdown and code fences. The proposal is structured, not narrative. The reasoning field is the only place where judgment is expressed in prose.

One design decision worth explaining: the PlayerAgent receives the full constraint schema as context. It knows the limits before it proposes. This is not to make the Coach’s job easier — it is to make the Player’s proposals coherent. A proposal that enters a 20% position on an account with a 10% max is not interesting to evaluate. It tells you nothing about how the system behaves at the edges of the constraint envelope.

What the CoachAgent actually does

The CoachAgent receives the proposal, the constraint schema, and the current world state. It evaluates every proposed action against every applicable constraint and returns a verdict with named violations and a critique.

from player_coach.agents.coach import CoachAgent

coach = CoachAgent(api_key=os.environ[”ANTHROPIC_API_KEY”])

evaluation = coach.evaluate(result, constraints, world_state)

The evaluation:

{

“verdict”: “APPROVE”,

“violations”: [],

“critique”: “allowed_symbols: AMZN in [AMZN, MSFT, TSLA, BTC-USD] — PASS. max_single_trade_pct: 0.03 <= 0.03 — PASS. max_open_positions: 1 entry action <= 2 — PASS. max_position_pct: 0.03 total <= 0.10 — PASS. min_risk_reward: |274.14 - 264.14| / |264.14 - 259.14| = 10.0 / 5.0 = 2.0 >= 2.0 — PASS.”

}

Or, when the proposal fails:

{

“verdict”: “REJECT”,

“violations”: [”min_risk_reward”],

“critique”: “allowed_symbols: AMZN in [AMZN, MSFT, TSLA, BTC-USD] — PASS. max_single_trade_pct: 0.03 <= 0.03 — PASS. max_open_positions: 1 entry action <= 2 — PASS. max_position_pct: 0.03 total <= 0.10 — PASS. min_risk_reward: |271.14 - 264.14| / |264.14 - 261.64| = 7.0 / 2.5 = 2.8... wait — |take_profit - entry| / |entry - stop_loss| = |269.14 - 264.14| / |264.14 - 261.64| = 5.0 / 2.5 = 2.0 — PASS. Rechecking: entry 264.14, stop 262.14, take 267.14 — |267.14 - 264.14| / |264.14 - 262.14| = 3.0 / 2.0 = 1.5 < 2.0 required — FAIL.”

}The Coach is not expressing a view on the trade. It is running arithmetic. The system prompt instructs it to check each constraint separately, show its working, and return no markdown, no code fences, and no explanation outside the JSON. The critique field is the working. The verdict is the result.

Two things in the Coach’s design are worth noting. First, the Coach receives max_tokens=1024. The critique field lists every constraint check with its arithmetic — that is considerably more text than a simple verdict. During development, a budget of 256 tokens confirmed the problem: the JSON was truncated mid-response, causing parse failures. 1024 provides sufficient headroom for complete critiques across typical constraint combinations.

Second, the system prompt explicitly forbids markdown. Haiku-class models default to wrapping JSON in code fences when not told otherwise. A single instruction eliminates an entire class of parsing failure.

How the CoachLoop orchestrates the exchange

The CoachLoop connects the two agents, manages the round limit, checks circuit breakers, writes the artifact, and persists to SQLite.

from player_coach.loop.coach_loop import CoachLoop

from player_coach.artifacts.writer import ArtifactWriter

from player_coach.database.store import DatabaseStore

store = DatabaseStore(”data/player_coach.db”)

writer = ArtifactWriter(output_dir=”artifacts”)

loop = CoachLoop(player=player, coach=coach, artifact_writer=writer)

artifact = loop.run(

world_state=world_state,

constraints=constraints,

db_store=store,

strategy_id=”conservative-amzn”,

output_dir=Path(”artifacts”),

)

print(artifact[”outcome”]) # APPROVE / REJECT-MAX / ABORT

print(artifact[”rounds_taken”]) # 1 / 2 / 3

The loop runs for up to constraints.max_rounds rounds (default: 3). Each round:

PlayerAgent proposes (passing prior round history so it knows what was rejected and why)
CoachAgent evaluates
If APPROVE: write artifact, write to SQLite, return
If REJECT: continue to next round with the full round history
If ABORT (hard constraint breach like max_leverage): abort immediately, no further rounds

After max_rounds rejections without approval, the loop terminates with outcome REJECT-MAX. The artifact records all rounds regardless of outcome.

Circuit breakers fire before the first round, not after. Four checks in priority order:

MLL breached — maximum lifetime loss exceeded, account terminated
Daily loss limit — today’s loss too large, skip today
Consistency rule — today’s gain too large relative to cumulative, skip today
Trading cutoff — past 16:20 ET, market hours ended

If any circuit breaker fires, no API calls are made. The artifact records the reason and returns immediately. This matters for backtesting: a strategy that hits the consistency rule at 14:00 should not call Claude 40 more times before 16:20.

What the artifact contains

Every completed exchange produces a JSON artifact on disk and a row in SQLite. The artifact:

{

“run_id”: “431c761c-4dbb-46da-b315-3af4990140b8”,

“timestamp”: “2026-05-17T08:20:29.375095+00:00”,

“outcome”: “APPROVE”,

“approved”: true,

“rounds_taken”: 1,

“termination_reason”: null,

“total_tokens”: 0,

“rounds”: [

{

“round”: 1,

“proposal”: {

“actions”: [...],

“reasoning”: “...”

},

“evaluation”: {

“decision”: “APPROVE”,

“violations”: [],

“feedback”: “allowed_symbols: AMZN — PASS. ...”

}

],

“constraint_snapshot”: {

“max_position_pct”: 0.10,

“max_single_trade_pct”: 0.03,

...

},

“portfolio_snapshot”: null,

“strategy_id”: “conservative-amzn”,

“symbol”: “AMZN”

}

The constraint_snapshot records the exact schema that was in effect at the time of the exchange. When you review an exchange six months later, you know which constraints governed it — not the current schema, but the one that was active when the decision was made.

The rounds array contains the full history. A three-round exchange with two rejections and a final approval records all three proposals, all three verdicts, all three sets of named violations, and all three critiques. Nothing is lost.

The constraint preset system

Five presets ship with the package:

examples/constraints/
conservative.json # tight limits, 2 positions max
moderate.json # balanced, 3 positions max
aggressive.json # wider limits, 5 positions max
strict.json # very tight, designed to force rejections
futures_compatible.json # daily loss and cutoff tuned for day trading

The presets are starting points, not recommendations. The Constraints page in the dashboard lets you load any preset, adjust every field with sliders, see the live JSON as you edit, and push the configuration directly to the Trade Review page. Export any configuration as JSON and it becomes a new preset.

Running the backtesting engine

The BacktestRunner replays the CoachLoop over historical trading days via yfinance. It is included in the package — check the repository for the current API, as the interface may evolve:

from player_coach.backtest.runner import BacktestRunner

runner = BacktestRunner(

player=player,

coach=coach,

constraints=constraints,

db_store=store,

)

result = runner.run(

symbol=”AMZN”,

start_date=”2025-01-01”,

end_date=”2025-12-31”,

initial_capital=100_000.0,

)

print(f”Total return: {result.total_return_pct:.1f}%”)

print(f”Days traded: {result.days_traded}”)

print(f”Days aborted: {result.days_aborted}”)

print(f”Approval rate: {result.approval_rate:.1%}”)

The dashboard

Above: player coach demo

The dashboard runs four pages. Trade Review runs a live exchange — market parameters and constraint preset in the sidebar, Player and Coach characters in the main area with streaming speech bubbles, round cards below recording each verdict, full artifact JSON at the bottom. Constraints lets you configure and export the schema. History shows all past exchanges from SQLite with filtering and replay. Settings handles BYOK API key entry — key lives in session memory only, never stored.

pip install player-coach-core[dashboard]
streamlit run dashboard/app.py

Or use the live deployment at [STREAMLIT_URL].

The characters are the part that is easy to dismiss as decoration. They are not. Watching the Player’s reasoning stream in real time, then the Coach’s constraint checks appear one by one, then the round card collapse with the verdict — that sequence makes the adversarial structure legible in a way that a static artifact does not. The exchange is visible. The challenge is visible. The constraint that blocked the first proposal and what the Player changed to satisfy it is visible.

That legibility is what makes the system usable, not just functional.

What to do with the accumulated evidence

Each completed exchange writes to five SQLite tables: exchanges, rounds, strategies, portfolio snapshots, and coach memory. The coach memory table accumulates observations across runs — which symbols, which patterns, which constraint combinations.

The ConstraintDeriver reads that accumulated evidence and derives a new schema:

from player_coach.constraints.deriver import ConstraintDeriver

deriver = ConstraintDeriver(db_store=store)

derived = deriver.derive(strategy_id=”conservative-amzn”)

print(derived.min_risk_reward) # derived from 25th percentile of historical R/R

print(derived.allowed_symbols) # symbols with confidence >= 0.6 in coach memory

The derived schema is tighter where evidence says constraints are routinely violated, and calibrated where evidence says the current limits are well inside safe ranges. You can review the derived schema, adjust it manually, and save it as a new preset. The evidence informs. The decision is still yours.

The complete picture

The Executable World Models series established the infrastructure: trajectories, artifacts, evaluation, experiments. Essay 6 (”Decisions That Don’t Disappear”) placed Claude in the agent slot and confirmed the infrastructure holds. “The Adversarial Quality Loop” described the structure and why it creates quality. This essay showed the implementation.

The player-coach system is not a recommendation engine. It does not tell you what to trade. It tells you whether a proposed trade satisfies the constraints you set — mechanically, with named violations, round by round — and records the full exchange so you know exactly what was challenged, what failed, and what finally passed.

The quality is not in the agents. The quality is in what surrounds them.

The code is at github.com/MaverickHQ/crucible-player-coach.

Install: pip install player-coach-core[llm] for agents, pip install player-coach-core[dashboard] for the full dashboard.
Live dashboard: [STREAMLIT_URL]

This is Essay 9 of the Executable World Models series.

Research Essays with Code

Discussion about this post

Ready for more?