Self-Improving Agent Harnesses with Judgment CLI

We gave GPT-5-nano the Judgment Agent and let it build a coding-agent harness from scratch.

Sujan Rachuri, Kian Kyers, Andrew Li·June 17, 2026·9 min read

Agent performance is not just about the model. The harness around the model, including prompts, tools, memory, retries, context management, and control flow, often determines whether an agent can complete long-horizon tasks.

But harnesses are still tuned manually. Developers run evals, read failed trajectories, guess what went wrong, patch the harness, redeploy, and then discover a new set of regressions. That loop works for small systems, but it breaks once failures are too long and too numerous for humans to inspect directly.

We wanted to know what would happen if the harness could improve itself. We gave GPT-5-nano access to Judgment Agent through our CLI for evals and trace analysis, Modal for sandboxing, and an empty agent.py file. Then we let it run unsupervised to build a coding-agent harness from scratch.

By the end, it consumed ~200B tokens, spun up 6K containers, and outperformed mini-swe-agent on SWE-Bench-Verified.

This post explains how we constructed the self-improvement loop, what worked, what failed, and what the surviving harness looks like.

The Harness Engineering Problem

Harness engineering means changing the scaffolding around a model: the tool interface, memory, context management, retry logic, and control flow that determine how the model acts over time.

In practice, it often feels like prompt engineering moved into code. Instead of changing a sentence in the system prompt and hoping the eval improves, you change a tool schema, add a planner, tweak a retry rule, introduce a subagent, or modify how context gets summarized. Sometimes the score goes up. Sometimes it gets worse. Often, one failure mode disappears and another takes its place.

That kind of manual loop can work when trajectories are short enough for humans to inspect directly. But for long-horizon agents, a single run can contain thousands of tool calls, edits, logs, tests, partial plans, and abandoned hypotheses. The final eval tells you whether the agent succeeded, but not what part of the harness caused it to fail.

Systems like DSPy and GEPA showed that prompt optimization does not have to be done entirely by hand. But prompts are only one part of the agent. If the harness determines what the model can see, what it can do, how it remembers, and how it recovers, then the natural next question is: why stop at the prompt?

Data Layer

To make the harness improve itself, the agent needed three abilities: edit the harness, run it against real tasks, and understand why it failed.

We started with an empty agent.py file. GPT-5-nano was tasked with turning that file into a working coding-agent harness. In each iteration, it proposed candidate harness variants by editing the code directly, then branched from the strongest variants in previous iterations.

Those candidates were run in isolated Modal sandboxes against 820 software engineering and coding tasks drawn from SWE-Bench-Pro and Terminal-Bench 2. Each run produced a full trajectory: prompts, tool calls, terminal output, code edits, errors, retries, and final outcomes. Each search run lasted 17 iterations, and the surviving harness was then validated on SWE-Bench-Verified as a held-out set.

The hard part was the third ability: understanding why a harness failed. Pass/fail scores were too thin. A failed trajectory could contain thousands of tool calls, repeated actions, bad tool calls, context loss, stalled recovery attempts, premature termination, and partial fixes that looked plausible but did not actually solve the task.

To give GPT-5-nano that feedback, we connected it to:

Judgment CLI: access to eval results, traces, and trajectory search. This gave the model visibility into what happened during failed runs.
Judgment Agent: an analysis agent connected to the same trace and eval data. Its job was to inspect large volumes of runs, surface recurring behavioral patterns, and suggest changes back to GPT-5-nano.
Advisor: a stronger model that GPT-5-nano could consult when it decided a situation was worth escalating. The advisor did not take over the loop. GPT-5-nano chose when to ask, what context to share, and whether to use the answer.

GPT-5-nano works through the Judgment CLI for visibility into runs; the Judgment Agent reads the same trace and eval data to surface recurring patterns; and GPT-5-nano escalates to the Advisor when it chooses.

Without trace visibility, the agent drifted off-task more often. Its only signal was outcome-based, so it tended to make shallow fixes after seeing that a run failed, without understanding the trajectory that led there. Task decomposition was rare. The agent typically planned once or twice before committing to a fix.

With Judgment and the advisor in the loop, trajectories became noticeably more coherent. The agent could inspect failed trajectories, retrieve similar failures, see recurring behavioral patterns, ask for help on difficult decisions, and turn those findings into the next code change. It planned ~45% more often, analyzed failures more methodically, and stayed on-task for longer stretches before committing to an edit.

The Best-Scoring Harness Was Not Necessarily the Best Harness

Each generation tested multiple harness candidates. Gen2 a2 means candidate a2 from the second generation of the harness search.

In Gen2, the eval did not produce a simple winner. Candidate a2 had the best raw score, but candidate a4 looked stronger when compared on the same shared tasks. A naive loop would have promoted a2 because it was highest on the leaderboard. Judgment surfaced a more careful interpretation: keep both candidates, replay them, and test a hybrid.

Judgment Agent reasoning:

Gen2 is partial but selection-useful.

Raw leader:
  a2 = 66/820
  - concise prompt
  - 60-turn budget
  - low error pressure

Shared-task leader:
  a4
  - context-budget prompt
  - 45-turn budget
  - beats a2 on the five-way shared set: 38 vs 37
  - beats a2 on the a2/a4/a0 shared set: 48 vs 46

Decision:
  Do not overfit to one draw.
  Keep both candidates.
  Replay the raw-score winner.
  Replay the shared-task winner.
  Create a hybrid that combines both signals.

That analysis changed what GPT-5-nano wrote next:

# Gen3 a0: replay the raw-score winner
AutoAgent = install_prompt(A6_PROMPT, max_turns=60)

# Gen3 a1: replay the shared-task winner
AutoAgent = install_prompt(A6_BUDGET_PROMPT, max_turns=45)

# Gen3 a2: cross-breed both signals
# a4's context-budget prompt + a2's 60-turn budget
AutoAgent = install_prompt(A6_BUDGET_PROMPT, max_turns=60)

The important part is not the prompt name or the turn count. The important part is that the loop learned not to trust the raw leaderboard alone. Judgment was able to compare previous runs with each other and reason about new changes to suggest.

Self-Improvement over SWE-Bench-Verified

We evaluated the surviving harness on SWE-Bench-Verified and compared it against the public mini-swe-agent GPT-5-nano baseline.

Harness	SWE-Bench-Verified
Surviving harness	46%
mini-swe-agent + GPT-5-nano	34.8%

The surviving harness differed from mini-swe-agent in three main ways:

Narrow evidence collection. Instead of broad terminal dumps, the harness favored targeted search, bounded file slices, and focused test output. This kept context lean while preserving the error messages, diffs, and verifier outputs most likely to matter.
Recovery under context pressure. When a run stalled or exceeded the context window, the harness could compact the state and retry with a shorter, more focused prompt. A hard context failure became a constrained second attempt rather than a dropped trajectory.
Failure-aware replanning. Before repeating the same action or terminating early, the harness could retrieve similar prior failures and check whether the current trajectory had made meaningful progress.

What's Next

This experiment worked, but it was painfully expensive: hundreds of billions of tokens, thousands of containers, and many rounds of trace analysis to produce one stronger harness. To make this loop practical, the next step is building infrastructure that lets agents search across billions of production trajectories, find the few behaviors that matter, reason over why they happened, and turn that experience back into better AI systems.

Subscribe to our newsletter

Get new posts delivered to your inbox.