What we learned building the harness around our coding agents

What we built, what changed, and what we would do differently building the harness around Claude Code and Codex, organized around seven parts: context, the context graph, workflow, restraint, empowerment, verification, and visual interface.

Karl Wirth ·

Teams building with AI usually end up building two products: the thing they ship, and the system around their agents that makes them useful on real work.

We built that system to help us build Nimbalyst. This post is about what we learned from doing it.

A second harness above Claude Code and Codex: prompt, seven-part harness, agent harness, and the underlying model

What a harness is

A harness is the durable layer around a model: instructions, tools, permissions, context, and verification.

Claude Code and Codex are harnesses in this sense. Each wraps a model with a system prompt, a tool surface, a permission model, and an execution loop. Anthropic and OpenAI own that layer.

Your team owns the next layer up: the workspace where agents do product work alongside you, with your files, tasks, diagrams, diffs, and decisions. This layer carries the knowledge your team has accumulated: how you build things, what you already decided, what is connected to what, where the agent is allowed to act, and how it checks its own work.

The line between context and harness can blur. A ticket or spec is task-specific context, but the mechanism that makes that ticket searchable, linkable, versioned, and retrievable by any agent is part of the harness.

Almost nothing in a good harness is novel. It is mostly other people’s parts assembled around your project: Claude Code, Codex, MCP, Playwright, a tracker, a diagramming tool, an editor, a test runner, your repository, your docs. The harness is the way those pieces are put together so an agent can pull the right context for a task and verify what it produced.

We think about our harness in seven parts: context, context graph, workflow, restraint, empowerment, verification, and visual interface.

1. Context

Context is everything specific to your project: code, specs, design docs, tracker items, data models, past decisions, conventions, examples, and recipes.

In our harness that means:

  • Code, specs, plans, and mockups live as local files in formats an agent can read and edit directly.
  • Architecture diagrams live as Excalidraw files instead of screenshots trapped in a slide deck.
  • Decisions are captured as tracker items, not buried in chat transcripts.
  • Bug histories are searchable, so the agent can see symptoms, root cause, and previous fixes.
  • Root instruction files like CLAUDE.md and AGENTS.md load at session start and point the agent at the rest.
  • Path-scoped rule files load only when the agent touches a relevant directory, so React rules show up for renderer code and Swift rules show up for the iOS package.
  • A skill system holds reusable instructions for recurring jobs: how we write tests, add analytics events, release a package, or debug a failing screen.
  • Data models, fixtures, and recipes live in source control instead of tribal memory.
  • Context files can import shared guidance instead of duplicating it across ten drifting copies.
Context pillar: CLAUDE.md, rule files, examples, data model, skills

For example, an agent editing renderer code loads React rules without loading iOS rules, and an agent fixing a regression can find the prior bug, root cause, and fix before writing code.

Each session starts with the team’s accumulated decisions already in scope instead of being re-derived from the prompt.

2. Context graph

Before we had this pillar, our work lived as a pile of tabs. Tickets were in one tool, planning docs in another, design diagrams somewhere else, diffs in the IDE, and working sessions in Claude Code or Codex. The connections between those artifacts lived in our heads, so the agent could not traverse relationships that were never recorded.

In our harness that means:

  • A persistent, typed link graph between tracker items, plans, specs, diagrams, mockups, sessions, diffs, files, commits, and decisions.
  • First-class editors for those artifacts inside the same workspace, so links resolve to actual working content.
  • A graph that supports useful traversals, not just backlinks.
  • An MCP surface so different agents can traverse the same graph during a session.
  • A durable chain from “why are we doing this?” to “what changed?” and “what happened after?”
Context graph pillar: tracker, sessions and files, commits and tasks, decision log, memory graph

A bug can link to the failing screenshot, the fixing session, the diff, and the commit. A feature request can link to the plan, the mockup, the implementation sessions, and the release note. Instead of copy-pasting six links into a prompt, the agent can follow the same chain you would.

3. Workflow

Workflow is the shape of a coding session: how it starts, how it plans, how it gets help, and how it parallelizes. Without a workflow layer, every session improvises its own arc, and the agent has to be told the basics again each time.

In our harness that means:

  • Repo-local slash commands in .claude/commands/ for the steps we run over and over: plan, implement, review, release.
  • A standard plan-then-execute arc for non-trivial work, so the agent commits to an approach before changing files.
  • Subagents for exploration, planning, and implementation that take broad searches and protect the main session’s context.
  • A skill system for reusable habits like writing tests, adding analytics events, or releasing a package.
  • Git worktrees and sibling sessions so multiple agents can work on isolated checkouts of the same repo without stepping on each other.
Workflow pillar: slash commands, plan and execute, subagents, skills, worktrees

A /release-alpha command can run the version-bump, changelog, and tag steps the same way every time, and a worktree can let one agent fix a bug on main while another finishes a feature branch in parallel. A workflow layer keeps each session from reinventing itself.

4. Restraint

Restraint is how you stop an agent from doing the wrong thing quickly. It covers hard rules, approval boundaries, permission scopes, tool allowlists, budget limits, and an audit trail.

In our harness that means:

  • Path-scoped rules that block agents from editing specific files or directories.
  • Hard rules in instruction files for things the agent must never do, like reading .env files or touching credentials.
  • Per-tool permission scopes and allowlists.
  • Approval flows for actions that touch shared or costly state: push to main, drop a table, hit a paid API, or run a destructive shell command.
  • Workspace trust modes that separate “can edit files” from “can do anything.”
  • Durable audit trail of approvals, tool calls, and file changes.
  • Review surfaces that make it obvious what the agent actually changed.
Restraint pillar: path-scoped rules, hard rules, permission scopes, tool allowlists, audit trail

In practice, that means letting an agent refactor renderer code but not release scripts, query a development database but not production, and spend tokens on test loops without touching paid third-party APIs unchecked. A capable agent without restraint eventually does something expensive, destructive, or embarrassing faster than you expected.

5. Empowerment

Empowerment is the opposite of restraint in the right places. It covers tools that let an agent touch live state and verify what it did: reading logs, querying a running database, driving the UI, taking screenshots, running tests, and looping until the result is correct.

In our harness that means:

  • MCP tools that read live application logs and query the running database through the app instead of unsafe direct access.
  • Tools that expose rendered state when text output is not enough.
  • A Playwright-driven UI loop so an agent can interact with the running app, take a screenshot, and verify the result.
  • MCP tools that wrap third-party systems the agent uses every day: GitHub, the analytics dashboard, the browser, the tracker.
  • A sandboxed shell so the agent can run tests, scripts, and safe codemods inside the workspace.
  • An extension SDK so teams can write their own MCP tools and ship them inside the workspace.
Empowerment pillar: logs and DB queries, UI and vision, Playwright loop, MCP tools, sandbox and shell

After changing a React component, the agent can open the screen and check a screenshot. After changing persistence logic, it can verify that the row actually changed. An agent that can inspect the actual result can often close its own loop.

6. Verification

Verification is how an agent proves a change works before handing it back. It covers tests, type checks, fail-first reproduction of bugs, and simulated runs of the agent’s own tool calls.

In our harness that means:

  • A Vitest unit suite that runs across packages and gives fast feedback on logic-level changes.
  • Playwright end-to-end tests for real flows, one spec per run so a failure points at one place.
  • A fail-first discipline: write the failing test before writing the fix, so the bug has a reproduction the next agent can rerun.
  • An AI tool simulator that lets E2E specs fake AI tool calls and assert on what the agent did, without paying for a real model.
  • Fast type checks baked into the loop so the agent catches drift before tests even run.
Verification pillar: unit tests, E2E, fail-first, AI tool simulator, type checks
  • A fix for a sync bug starts with a Playwright spec that opens the broken document and asserts the body loads, then the agent fixes the code until that spec turns green.
  • A renderer change runs the unit suite and the type check before the agent claims it is done.

If the agent cannot show the change works end-to-end, it is not done.

7. Visual interface

A lot of software work is visual. Markdown review, UI mockups, architecture diagrams, data models, diffs, screenshots, and sketches are part of the task input, not presentation garnish, so they are part of the harness too.

In our harness that means:

  • A workspace where the mockup, the diff, and the tracker item sit side by side, so the thing being reviewed and the place the agent works are the same place.
  • A markdown editor with red/green diffs for agent edits, plus diff review across every file the agent touched in a session.
  • Mockup, diagram, and data-model editors as first-class file types, with image and screenshot inputs the agent can read and produce directly.
  • Approval gates on risky actions like merges, deploys, and pushes to main, with the diff and linked tickets shown in one view before approval.
  • Threaded discussions tied to tracker items, diffs, and decisions, so the reasoning lives next to the artifact instead of vanishing into a chat tool.
  • Team handoff posts when a session ends, summarizing what shipped, what is still in flight, and what risks remain.
Visual interface pillar: visual workspace, diffs and review, approvals, team handoffs, discussions

The agent can edit a mockup, render it, compare the screenshot to the request, and then review a red/green diff before merge in the same workspace. A visual workspace keeps decisions attached to artifacts instead of burying them in chat.

What changed once our harness covered all seven parts

Once we had these seven pieces in place, a few things changed:

  • Sessions resumed from prior context without re-prompting the same background every time.
  • A single prompt could pull in the linked plan, prior session, spec, and affected files through one graph traversal.
  • We could switch the same task between Claude Code and Codex without rebuilding the workflow above them.
  • Permission scopes and the audit trail made it practical to let agents run through multi-step work and review after the fact.
  • Agents could verify their own UI and backend changes through screenshots, log queries, and test loops before asking for review.
  • The useful parts of a session stopped disappearing with the chat window because the decisions, links, and artifacts remained in the workspace.

We regularly review transcripts for repeated mistakes and feed the patterns back into rules, linked context, and CLAUDE.md, so the next session does not relearn the same lesson. Decisions made during a session land in the tracker. New skills get written the moment we notice ourselves explaining the same convention twice. The harness gets better every week without anyone setting aside a “harness sprint.”

Here is what the seven parts look like filled in for a single concrete prompt:

A harness in action: a prompt, the harness parts populated with project-specific entries, the agent harness, the model, and the resulting outcome

Recommendations from our experience

If you are building your first harness this week

Do the boring parts first. They compound the fastest.

  • Put your specs, plans, diagrams, and checklists in files the agent can read directly.
  • Add one root instruction file and a small number of path-scoped rules.
  • Give the agent at least three verification tools: logs, tests, and browser or screenshot access.
  • Add approval gates for destructive, expensive, or shared-state actions.
  • Link tickets, docs, files, sessions, and commits so future runs can traverse prior work.

Doing those five things already moves you from “chatting with a model” to “operating a system that gets better over time.”

If you already have a harness, invest in it

Treat the harness as a product your team ships to itself.

A meaningful share of your AI effort should go into improving the system around the model, not just consuming completions from the model. That means writing better rules, wiring up better MCP tools, recording better decisions, adding better examples, and tightening the verification loop.

Pick a percentage of your AI effort that goes to the harness instead of feature work, protect it, and make sure every release cycle includes at least one improvement to one of the seven pillars.

Every rule, tool, example, and link makes future sessions cheaper and better.

Own your harness

You should own your harness: instructions, rules, tool definitions, links between work items, audit logs, and reusable skills.

If you cannot read it, edit it, version it, point a different agent at it, and take it with you, it is not really yours.

This matters more as models get closer to feature parity. As they converge, the advantage moves up a layer, into your accumulated workflow, your verification loops, your linked decisions, and your team memory.

Keep your harness portable across coding agents

Model competition is healthy, and you only benefit from it if your harness is portable.

When a new coding agent arrives, your team should be able to point a session at it without rebuilding the workflow above it. Claude Code today, Codex today, something else tomorrow, with the same files, same rules, same tools, and same graph underneath.

If switching the underlying agent means rebuilding your harness, then you do not really have optionality.

Nimbalyst can be one starting point for your harness

Nimbalyst is our open source workspace we use to assemble our own harness across these seven parts. It lets us run multiple coding agents side by side, so we can point a task at Claude Code, Codex, or whatever lands next without rebuilding the layer above them. The visual layer, the context graph, the empowerment tools, the file-based instructions, and the session model are all visible and inspectable.

Use Nimbalyst directly or inspect it and learn from what we have done there.