Part Enforcement

Chapter

11

Harness Engineering: The Model Is Not the Product

The same model. The same prompt. Two different harnesses. One scored 78% accuracy. The other scored 42%.1

I wish I could tell you this was a hypothetical. It isn't. Run the same language model through two different execution environments — one designed for the task, one generic — and the model's intelligence didn't change. Its context didn't change. Not the model — the infrastructure around it changed. And the results changed by thirty-six percentage points.

The model is the engine. The harness is the car. You can put a Formula 1 engine in a shopping cart. It will not win races.


What a Harness Is

A harness is the deterministic infrastructure that surrounds an AI agent — everything the agent doesn't decide. The execution sequence. The validation checkpoints. The file system structure. The tool access permissions. The context management strategy. The human touchpoints. The deployment pipeline.

I spent months treating the harness as an afterthought before I realized it was the whole stack. In Dark Factory, the harness is codified as twelve engineering principles. Not suggestions — the non-negotiable infrastructure that makes the pipeline reliable.

1. Sequential state machine. The eight-phase pipeline is a state machine with gated transitions. You cannot enter BUILD without passing the SPEC gate. You cannot deploy without passing CERTIFY. The agent operates within phases; the harness governs transitions between them.

2. Fixed plan, dynamic execution. The pipeline sequence is fixed — IDEA → DISCOVER → SPEC → BUILD → TEST → CERTIFY → DEPLOY → MAINTAIN. Always. Within each phase, the agent has creative freedom. The harness constrains the what (which phases, in which order), not the how (how the agent solves the problem within a phase).

3. Virtual file system. A .factory/ directory holds all artifacts — spec, discovery document, test results, certification reports, deployment checklists. Agents write to a workspace scratch pad. Artifacts persist between phases. This means every phase starts with the full history of what came before, and no agent can accidentally overwrite another phase's output.

4. Sub-agent delegation. Complex tasks are decomposed into sub-agents with isolated context. The orchestrator stays lean — it routes work and collects results. Sub-agents get fresh context per task, preventing context rot from long-running conversations.

5. Tool guardrails. Agents have access to phase-appropriate tools only. A BUILD agent has file system access and CI triggers. It does not have production database access. It does not have internet access in sandbox mode. The tools available define the boundaries of what the agent can do — and more importantly, what it can't.

6. Memory architecture. Three levels: short-term (the .factory/ directory within a session), long-term (project cards in the vault), and cross-project (conventions and patterns). The agent doesn't start every session from scratch, and it doesn't carry stale context from three projects ago.

7. Gated transitions. Every phase boundary is a gate. The gate has specific, verifiable requirements. Not "does this look good?" but "are all eight spec sections present and non-empty?" Not "is the code probably correct?" but "does it build, lint, and type-check?" Verifiable gates remove subjective judgment from transitions.

8. Execution environment parity. Agents use the same tooling humans would. Same IDE, same linters, same test runners, same git workflow. This means a human can pick up where an agent left off — and vice versa — without translating between environments.

9. Context management. Summaries to the orchestrator; full context to sub-agents. Save heavy outputs to files, not to conversation memory. Capped iteration prevents the context window from filling with failed attempts. Fresh context per task prevents cross-contamination.

10. Human-in-the-loop scaling. Touchpoints scale with the trust tier. Tier 1: human approves deployment. Tier 4: human reviews spec, intent contract, test results, and every consequential decision. The harness enforces these touchpoints — they can't be skipped.

11. Validation loops. Automated validation runs before every gate. Lint after every file. Type-check before every commit. Scenario execution after every build. The validation is automated, fast, and non-negotiable.

12. Skills on rails. Skills provide the domain intelligence — the prompts, the specialized knowledge, the creative problem-solving. The harness provides the structure — the execution sequence, the validation checkpoints, the tool boundaries. Intelligence on the rails. Neither works without the other.


The Principles in Practice

Twelve principles are a lot to hold in the abstract. Here is what several of them look like when they hit the code.

Principle 3 — Virtual file system. The .factory/ directory doesn't just hold the spec. It enforces an information architecture: agents write to the workspace, structured outputs go to artifacts, and nothing persists to the main codebase until a human approves the transition. This is what it looks like when a build agent is completing a session:

.factory/
├── CLAUDE.md                    # Standing orders — read every session
├── spec.md                      # Behavioral contract
├── session-log.md               # Updated at the end of each session
├── test-results/
│   └── 2026-04-03.md           # Scenario execution results
└── scratch/
    └── migration-draft.sql      # Working file — not committed until reviewed

The scratch/ directory is temporary workspace. The agent uses it for in-progress files — draft migrations, intermediate outputs, working notes. Nothing in scratch/ is part of the codebase. When the session ends and the work is reviewed, the relevant artifacts move to their permanent locations. This separation prevents the most common brownfield contamination: working files that get committed accidentally because they were in the project root.

Principle 5 — Tool guardrails. Phase-appropriate tool access is not a configuration that most developers think about — it's implicit in how you structure the session. Here's the difference:

Without tool guardrails, a BUILD agent has access to everything: production database credentials, deployment triggers, external APIs. The agent deciding when to run a migration or trigger a deployment is an agent making operational decisions it shouldn't make.

With tool guardrails, the BUILD agent has exactly: read/write access to the project file system, pnpm commands, and git (branch operations only — no merge, no push to main). Production database access lives in the DEPLOY phase. Deployment triggers are not available until CERTIFY passes. The tools define the blast radius — what the agent can affect if it makes a mistake.

In Claude Code specifically, this is implemented through the permissions.allow array in the project's settings.json:

{
  "permissions": {
    "allow": [
      "Bash(pnpm:*)",
      "Bash(git checkout:*)",
      "Bash(git add:*)",
      "Bash(git commit:*)",
      "Read",
      "Write",
      "Edit"
    ],
    "deny": [
      "Bash(git push:*)",
      "Bash(git merge:*)",
      "WebFetch"
    ]
  }
}

The agent can build. It cannot deploy. It cannot access the internet (no WebFetch in sandbox mode). The harness enforces this at the execution layer, not at the prompt layer — the agent doesn't need to decide whether to push; the tool isn't available.

Principle 7 — Gated transitions. Every phase boundary has a specific, verifiable requirement. Not "does this seem done?" — a checklist that can be audited:

## BUILD → TEST Gate

Required before running scenario suite:
- [ ] `pnpm lint` exits 0 (zero warnings)
- [ ] `pnpm typecheck` exits 0
- [ ] All new functions have types (no `any` except documented exceptions)
- [ ] No TODO comments in changed files
- [ ] Migration applied and `pnpm db:migrate` exits 0
- [ ] Hard boundary constraints reviewed against new code
- [ ] Session log updated

CERTIFY gate requires: scenario suite pass, stress test pass at tier threshold
DEPLOY gate requires: CERTIFY signed off by spec architect

The gate is binary: all items checked, or the transition doesn't happen. There is no "mostly done" state. This prevents the most common pipeline failure: a feature that passes casual inspection, gets merged, and breaks something in production because one checklist item was skipped.

Principle 9 — Context management. The context management principle solves two problems simultaneously. First: context windows fill up. A conversation that has been running for two hours has accumulated everything — early exploration, wrong turns, intermediate drafts, back-and-forth on decisions that were ultimately reversed. An agent trying to reason about the current state of the code while navigating a context full of "actually, let me try a different approach" is an agent working against itself.

Second: sub-agents need fresh context to do specialized work well. A sub-agent given the task "review this migration for correctness" performs better when its context contains exactly the migration, the schema, and the relevant spec section — not the entire conversation that led to writing the migration.

The implementation:

## Context Management Rules

Orchestrator session:
- Never accumulate more than 3 working files in conversation
- After each file completion: save to .factory/, summarize to 2 sentences, continue
- If context feels crowded: summarize the session so far to session-log.md, start fresh session

Sub-agent invocations:
- Provide: CLAUDE.md + relevant spec section + files to be changed
- Do NOT provide: full conversation history, unrelated files, scratch notes
- Output: structured result only (no "here's what I did" — just the artifact)

The two-sentence summarization rule sounds trivial. In practice, it forces a decision: what is actually important about what was just built? The parts that survive the two-sentence compression are the parts that will matter to the next session.


The Harness in Practice

The cleanest example I have of this architecture in the wild is Attacca Claw Desktop — a brownfield Electron app I inherited mid-build. Discovery produced new specs before the next round of fixes, and the artifact that survived discovery was the CLAUDE.md file. Not the code. Not the issue tracker. The standing-orders document the agent reads at the top of every session.

A harness config carries three layers of constraint. First, hard boundaries that encode the stakes of the product:

- User API keys must NEVER leave the local machine.
- All Anthropic API calls originate from the main process (Node.js).
- The renderer never holds the API key in memory.

I realized reading this back that those three lines are doing the work of an entire threat model. The agent doesn't have to reason about why the renderer shouldn't hold credentials. The rule is stated; the bottleneck is closed.

Second, architectural invariants — the parts of the stack the agent is not allowed to touch:

- Permission engine (`permission-gate.ts`) is untouched. Risk classification
  by verb (GET/LIST → low, CREATE/UPDATE → medium, SEND/DELETE → high) is
  the moat. Do not refactor into a generic policy engine.
- Tool calls route: agent → tool-executor.ts → permission gate → Composio
  SDK → REST fallback. Do not short-circuit the gate.

Third, delegation patterns that tell the agent where to decompose and where to stay narrow — which sub-agent owns the memory layer, which owns IPC, which owns the renderer. Not a list of commands. A map of ownership.

What these rules encode is not style. They encode where the liability sits and which layers are the moat. They are text — not code — because the agent enforces them by reading them, the way a new engineer enforces them by reading the onboarding doc. The harness file is the durable artifact. Sessions end; the config persists. The next agent that opens the repo starts from the same standing orders the last one did. That is the architecture of harness configuration: a human-readable contract, machine-enforced via agent attention, rewritten only when the domain shifts.


Two Harnesses, Two Philosophies

I work with two harnesses daily. Not competing tools — two fundamentally different philosophies of agent development.

Claude Code is a collaborator. It runs locally, on your machine, with access to your file system and your Unix primitives. It explores the codebase, reads files, makes changes, runs tests — all in conversation with you. You see what it does. You course-correct in real time. It's designed for human-in-the-loop development: greenfield projects, complex specifications, situations where the agent needs your judgment at every turn.

OpenAI's Codex is a contractor. It runs in an isolated container with its own development tools. You give it a task. It works autonomously — reading the repo, writing code, running tests — and delivers a result. You don't see the work in progress. You see the pull request. It's designed for autonomous execution: brownfield changes with clear specifications, well-tested codebases, situations where the task is well-defined and the agent doesn't need real-time guidance.

Dark Factory uses both. Claude Code for greenfield, where the spec is being written in conversation with the agent. Codex for brownfield, where the spec is clear and the discovery document maps the terrain. The choice isn't about which model is better — it's about which harness philosophy matches the task.

The Hernan story lives right here. He chose Cursor over Claude Code because Cursor shows diffs, asks for approval, lets him see the code. His instinct was to add visibility — to put a harness around the agent's output that matched his working style. That's correct harness thinking. The specific tools he chose were Cursor's UX features, which he understood intuitively as trust infrastructure. What he needed additionally was context infrastructure — the CLAUDE.md, the spec reference, the hard boundaries — so that the agent's creative work started from the right place, not just got reviewed at the right time.

The roller coaster he described came from reviewing output without knowing the input. He saw the diffs (output). He didn't have a systematic way to compare them against the spec's intent (input). A harness provides both: a visible record of what the agent was told to do, and a structured way to verify that it did it. The diff Hernan was reviewing was the answer. What he was missing was the question.


Human Touchpoints in Practice

Principle 10 — human-in-the-loop scaling — is the most important principle for Tier 3 and Tier 4 systems, and the one most often treated as a formality. The harness doesn't just recommend human checkpoints. It enforces them.

What enforcement looks like depends on the tier.

Tier 1 — Assisted: Human approves deployment. One gate. The agent builds, tests, and produces a deployment artifact. A human reviews the deployment summary (what changed, what was tested, what the rollback plan is) and presses the button. The human is not reviewing the code; they're approving the operational act.

Tier 2 — Constrained: Human reviews the certification report before deployment. The certification report includes: scenario suite results, any stress test variations that were applied, and a one-paragraph summary of what was built. The review takes five minutes for a feature that passed cleanly. It takes longer if anything is flagged. The gate is the review, not just the approval.

Tier 3 — Supervised: Human reviews spec changes, intent contract changes, and the full test results before deployment. A domain expert (not the developer) reviews the scenario suite for representativeness — "do these test cases reflect how real users actually use this?" This is the tier where the evaluators and the builders are different people.

Tier 4 — Controlled: Human is present at every significant decision point. Not just the final review — the spec walkthrough, the intent contract definition, the scenario library review, the stress test threshold setting. For the BuildingManagementOS compliance module, this means the administrador's legal advisor reviews the hard boundaries before the code is written. It means a compliance officer signs off on the scenario library. It means no code ships without an explicit human attestation that the behavior is correct under Colombian law.

The harness enforces these touchpoints structurally — not through reminders, but through gates. The BUILD agent doesn't have production deployment access. The CERTIFY gate requires a signed-off report. The DEPLOY trigger requires human input that can't be simulated by the agent. Each touchpoint is a tool restriction, not a process request.

The temptation under time pressure is to skip touchpoints. "I'll add the legal review later." "This is just a small change, the certification is overkill." These are the choices that produce liability — in the literal legal sense for Tier 4 systems, and in the operational sense for all of them. The harness is not strict about touchpoints because strictness is satisfying. It is strict because the touchpoints are the moments where accumulated agent decisions get a human sanity check, and skipping them is how subtle errors compound into significant failures.


Intelligence on Rails: The Skills Layer

Principle 12 — skills on rails — is the architectural principle that makes the harness approach sustainable at scale. It deserves more than a sentence.

A skill is a unit of agent capability that can be developed, tested, and deployed independently of any specific project. It has four components: a prompt optimized for a specific task, a tool set (the tools the skill is allowed to call), an output format (what the skill produces), and an eval suite (the scenarios and stress variations used to test it).

Here is what a complete skill definition looks like:

## Skill: competitor-analysis
Version: 1.2.0 | Domain: Research | Tier: 1–2

### Prompt
You are a competitive intelligence analyst. Given a brand name and market segment,
produce a structured competitive analysis covering the top 3–5 direct competitors.

For each competitor, extract:
- Positioning statement (1 sentence)
- Primary differentiator vs. the target brand
- Estimated market presence (traffic tier: high / medium / low)
- Key weaknesses or gaps

Follow the Output Format exactly.
If a competitor cannot be found, return NOT_FOUND — do not guess.

### Tool Set
Allowed:
  WebSearch       (max 5 calls per competitor)
  WebFetch        (public sites, Crunchbase, LinkedIn company pages only)

Denied:
  WebFetch to social media platforms
  WebFetch to any authenticated endpoint
  Any internal database access

### Output Format
{
  "target_brand": string,
  "market_segment": string,
  "analysis_date": ISO-8601,
  "competitors": [
    {
      "name": string,
      "positioning": string,
      "differentiator": string,
      "market_presence": "high" | "medium" | "low",
      "gaps": [string]
    }
  ],
  "confidence": "high" | "medium" | "low",
  "coverage_notes": string
}

### Eval Suite
Base scenarios (regression — must pass before any version update):
  1. Well-documented brand in a competitive market → expect ≥4 competitors, all fields populated
  2. Niche brand with <3 findable competitors → expect partial results, NOT_FOUND for missing
  3. Brand name collision (two companies, same name) → must disambiguate by market segment

Stress variations:
  - Non-English language market → search in market language, output in English
  - Brand with no public competitor data → confidence = low, coverage_notes explains gap
  - Market segment with no clear category leader → must not fabricate a leader

Pass threshold (Tier 2): ≥90% field accuracy on base scenarios, zero hallucinated competitors

A skill file is the unit of truth for a capability. When the competitor-analysis skill runs in the VZYN Labs platform, it runs this prompt, with these tools, producing this format, and has passed these scenarios. The platform doesn't redefine any of it — it just calls the skill.

Skills plug into the harness the way libraries plug into a codebase. The harness provides the execution environment; the skill provides the capability. The competitor-analysis skill doesn't know or care whether it's running in the VZYN Labs marketing platform or the Regasificadora del Pacífico market intelligence module. It takes inputs, calls tools, and produces a formatted output. The harness routes the right inputs to it and handles the output.

This separation produces two benefits that compound over time.

Skills can be reused across projects. The document-extraction skill developed for the Declara IA tax platform — which handles PDF table parsing, handles the ambiguous field mapping, and produces a clean JSON output — is the same skill used in the Regasificadora del Pacífico manual digitization project. It was tested once against a real Formulario 220 PDF, the test cases were added to the eval library, and it now runs reliably in both contexts. The total cost of the skill is amortized across every project that uses it.

Skills can be improved without touching the projects that use them. When Gemini's PDF extraction improved in a model update, the document-extraction skill was updated and tested. The two projects that used it got the improvement on their next deployment, without any code changes on their end. The model change protocol (from Chapter 9) ran against the skill's eval suite, confirmed the improvement was genuine and didn't introduce regressions, and the update shipped.

The Dark Factory skill catalog has fifty-seven skills across eleven domains, built and validated during the VZYN Labs project. Across all of Nirbound's subsequent projects — Declara IA, BuildingManagementOS, Regasificadora del Pacífico — roughly thirty of those skills have been reused without modification. The time savings are not from having built the skills. They're from having tested them: a tested skill is a proven capability, and proven capabilities are what allow the methodology to ship reliable products faster than traditional development at the same quality level.

The alternative — building capability inline, embedded in agent prompts written during a build session — produces systems where the intelligence and the infrastructure are tangled together. Changing the prompt risks breaking the system. Reusing the capability means copying the prompt and hoping the context matches. Testing means testing the entire system rather than the discrete capability.

Skills on rails is the answer to the question: "How do you maintain quality when you're building faster than traditional development?" You maintain quality at the capability level, before the capabilities are assembled into products. The harness is the assembly environment. The skills are the components. Good components in a good environment produce reliable products.


Harness Anti-Patterns

Most teams that build AI agent systems without a formal harness aren't building nothing. They're building informal harnesses — ad hoc collections of conventions, process rituals, and shared understandings that serve the same function as the twelve principles but do so unreliably, invisibly, and in ways that don't transfer between projects or team members.

Recognizing the anti-patterns is the first step to replacing them with something that works.

The "Prompt as Harness" anti-pattern. The team encodes all constraints into the system prompt. "Always use TypeScript. Never modify production tables directly. Check for existing functions before writing new ones. Format all dates as ISO 8601." The prompt grows. It reaches 3,000 words. Nobody is sure what's actually in it. The agent follows most of it most of the time. The constraints that get skipped are the ones at the end, after the model's attention has diluted across the length of the instruction.

Prompts are not harnesses. Prompts are input to a probabilistic system. A constraint in a prompt will be followed probabilistically. A constraint in a permission configuration, a linting rule, or a gate checklist will be enforced deterministically. The distinction is reliability: the harness enforces constraints; the prompt requests compliance.

The "Ritual Checkpoint" anti-pattern. The team adds a step to their process: "before shipping, ask the agent to review the code against the spec." This feels rigorous. The agent produces a thoughtful review. Developers feel confident. The confident wrong answers ship anyway, because an agent reviewing agent output with the same underlying model will tend toward the same conclusions.

Rituals substitute process for enforcement. They create the sensation of a gate without the structure of a gate. A real gate has verifiable requirements — items that can be checked true or false — and stops the pipeline until they're met. A ritual gate is a step in the process that people follow when they remember to and skip when they're busy.

The "Memory Through Context" anti-pattern. Rather than maintaining a session log and project artifacts, the team uses very long conversations. They keep the same chat window open for days, referencing earlier parts of the conversation as though the model had genuine memory. "Remember when we decided to use the trigger-based compliance log?" The model may produce a coherent response that suggests it remembers. Whether it actually consulted that part of its context is unknowable.

Long contexts degrade. The early parts of a very long conversation receive less attention than the recent parts. Decisions made ten hours ago in the same chat window are effectively lost — not because the model can't access them, but because the model's attention is unevenly distributed across the context window. Explicit session logs written to structured files, read at the beginning of each session, are more reliable than hoping the model navigates a long conversation correctly.

The "Organic Harness" anti-pattern. Senior team members know the implicit conventions — they've internalized the project's unwritten rules through months of experience. New team members and new agents are repeatedly corrected when they violate these conventions. The conventions are never written down because "everyone knows them."

Organic harnesses fail catastrophically when the senior team member is absent. They fail gradually when an AI agent joins the team, because the agent has no mechanism for absorbing implicit knowledge. Every implicit convention that exists in a person's head but not in a CLAUDE.md is a rule that the agent will violate at the worst possible moment.

The twelve principles of harness engineering are not a replacement for good engineering judgment. They are a structure for making that judgment explicit, durable, and transferable — to team members, to agent sessions, and to future projects.


Harness Debt

Technical debt is the accumulated cost of decisions made for short-term convenience that reduce the quality of the system over time. Harness debt is its counterpart — the accumulated cost of building without a harness, measured in the overhead that has to be managed manually because the harness doesn't manage it automatically.

It is possible to build AI agent systems without a harness. Many teams do. They write prompts inline, run agents without explicit permission restrictions, skip the phase gates, and make deployment decisions through conversation rather than structured checkpoints. The early sessions feel fast — no setup time, no CLAUDE.md to write, no spec format to fill out. The agent produces output quickly and you ship it quickly.

The debt accumulates in what you don't have: no session log means every session starts cold. No hard boundaries means the agent occasionally makes decisions that have to be manually reversed. No phase gates means you have no systematic way to know when you're done with one phase and starting the next. No eval library means quality assurance is "did I test it manually before shipping?"

The cost of each missing piece is small per session. Across a project, it's significant. Across multiple projects, it becomes the primary constraint on how fast you can move. You're slower because you're compensating manually for infrastructure that doesn't exist.

The harness is setup time that pays back in every subsequent session. The CLAUDE.md takes an hour to write well and saves twenty minutes of context re-establishment per session. An eval library takes a week to build from scratch and saves every production incident that it catches before deployment. The phase gates feel like process overhead until the first time they catch something that would have shipped.

Harness debt, unlike technical debt, doesn't accumulate gradually. It shows up as a step function: the project works fine in early sessions, then reaches a complexity threshold where the manual compensations break down, and suddenly nothing is working right. The VZYN Labs timeline is the canonical example — two months of sessions that felt productive, followed by an architect's review that revealed the codebase was too complex to maintain. The harness debt wasn't visible until the threshold was crossed.

The remedy is incremental: add the harness before you need it, not after you feel the pain.


When the Harness Is Absent

The VZYN Labs story is covered elsewhere in this book. But it's worth examining through the harness lens specifically. Not a model failure. Not a spec failure. A harness failure.

The engineers built a hexagonal architecture — a pattern that makes explicit everything the Dark Factory harness makes implicit. Ports and adapters encode the interface contracts that CLAUDE.md's hard boundaries encode. Domain objects enforce the separation that .factory/ enforces through directory structure. Quality gates in the pipeline enforce what the gated transitions principle encodes as a checklist.

The difference: hexagonal architecture codifies these constraints in the language of the codebase. The harness codifies them in the language of the agent. An agent navigating hexagonal architecture has to understand the abstraction to operate correctly within it. An agent working with a well-configured CLAUDE.md and a .factory/ directory follows the constraints without needing to understand the theory behind them.

The engineers weren't wrong to want those constraints. They were wrong about where to encode them. The right place for constraints on what an agent can and can't do is the harness — the CLAUDE.md, the permission configuration, the gated transition checklist. Not the architecture, which the agent has to navigate around.

This is the insight that makes the model-harness distinction more than a metaphor. A Formula 1 engine in a shopping cart fails not because the engine is inadequate for the terrain. It fails because the vehicle's constraints — the architecture — weren't designed for the engine's capabilities. Dark Factory puts the constraints where the agent can read them directly: not embedded in the codebase structure, but written as explicit standing orders.


The Bitter Lesson for Harnesses

In 2019, Richard Sutton published what he called "The Bitter Lesson" about the history of AI research. The thesis: researchers who build domain-specific clever systems — systems that encode human knowledge about how to solve the problem — are consistently outperformed, over time, by researchers who build general systems that can learn from scale. The bitter part: human cleverness in the system design doesn't help as much as we want it to. Raw computation and data win.

There is a version of this lesson that applies to harness design. As models get stronger — more capable of navigating complex environments, more reliable in their reasoning, better at inferring intent from ambiguous instructions — simpler harnesses win.

The CLAUDE.md for a project in 2023 needed to be very explicit about things a 2026 model handles naturally. "Always use TypeScript, never any." "Prefer server components to client components." "Check that routes are defined before implementing handlers." These were necessary guard rails when models would drift into anti-patterns without explicit instruction. They are less necessary now — not because the discipline changed, but because the model's defaults improved.

The practical implication: audit your CLAUDE.md every few months. What constraints are you encoding that the model now handles correctly without instruction? Remove them. A harness that grows but never shrinks accumulates instructions that are either redundant (the model already does this) or contradictory (the model's improved behavior conflicts with an instruction written for its older behavior). The harness is not a permanent document. It is a living contract that should get simpler over time, not more complex.

This doesn't mean the harness disappears. Hard boundaries don't disappear — the model getting smarter doesn't change the compliance requirements for a Colombian building governance system or the safety requirements for a prescription medication referral flow. Those constraints stay. What shrinks is the operational scaffolding: the detailed instructions about tool usage, naming conventions, and code patterns that exist to compensate for model limitations rather than to encode domain requirements.

The model is not the product. But the better the model, the less the harness needs to explain. The harness that ages well is the one that encodes the invariants — the constraints that are true regardless of model capability — and releases the compensations — the instructions that were needed because the model wasn't good enough to infer them. Knowing the difference is what makes a harness engineer rather than a harness accumulator.

The practical test: for each item in your CLAUDE.md, ask "is this true because of the domain, or because of the model?" Domain truths stay. Model compensations get reviewed every six months. A compliance constraint for a Colombian building governance system is domain truth — it remains until Colombian law changes. A constraint telling the model to "always define types before writing function bodies" is a model compensation — it may become unnecessary as models improve at inferring type requirements from context.

This audit practice produces something surprising: as the harness simplifies, it becomes more legible. A CLAUDE.md with twenty domain constraints is easier to maintain, easier to onboard to, and easier to reason about than one with sixty instructions of mixed origin. The invariants are the stable ones. Build on those. Let the compensations go when the model no longer needs them.

The harness is infrastructure, not prescription. The best infrastructure is invisible — it enforces the right constraints without being noticed, and it shrinks over time as the systems it constrains become more capable. That's the goal: a harness that does exactly what's necessary, no more, and earns the right to do less as the intelligence it contains grows.


The next chapter goes deeper into the intent layer — the organizational alignment infrastructure that tells agents not just what to do, but why.


Footnotes

  1. Harness comparison study. [SOURCE — identify the specific paper or benchmark that produced the 78%/42% split; confirm model, task, and harness configurations used]