Part The Framework

Chapter

Build: Execution on Deterministic Rails

Hernan described his week as a "montaña rusa" — a roller coaster. He'd been working on Edifica, implementing features with Cursor and Claude Code. Some sessions produced exactly what he needed. Others produced what he called "crazy things" — changes that didn't make sense, that would have damaged the project if he hadn't caught them.

His response was natural: retreat to more control. He chose Cursor over Claude Code because Cursor shows diffs, asks for approval, lets him see the code. "My background is code," he explained, "so seeing the code opens more possibilities for me." Claude Code's terminal-first, trust-the-agent approach required a leap he hadn't made yet.

I used to think the same way Hernan did — that the fix was a better model or a tighter prompt. I was wrong. Hernan's instinct was right, but not for the reason he thought. Not the model — the harness. The difference between an agent that ships reliable output and an agent that ships "crazy things" isn't intelligence. It's the deterministic layer around the intelligence.

The Agent Does the Creative Work. The Harness Does Everything Else.

The build phase has a simple principle: every step that must happen every time is codified in software, not trusted to the model.

The agent's job is creative — solving problems, choosing patterns, writing code. That's what language models are good at. But the steps around the creative work — git operations, formatting, linting, type-checking, dependency management, branch isolation — are deterministic. They should happen the same way, every time, regardless of what the agent decides to build.

This is the harness. The same model scores 78% accuracy in one harness and 42% in another. Not the model — the stack around it. The model is the engine. The harness is the car. Most teams optimize the engine and ignore the car, then wonder why the ride is rough.

The `.factory/` Directory

Every project in the Dark Factory methodology has a .factory/ directory at its root. This is the harness's home — the infrastructure that lives alongside the codebase but is not part of it.

The directory has a fixed structure:

.factory/
├── CLAUDE.md          # Agent configuration — the most important file
├── spec.md            # The behavioral specification (living document)
├── discovery.md       # For brownfield projects: codebase map + system model
├── intent.md          # Organizational intent contract
├── test-results/      # Latest scenario execution outputs
│   └── YYYY-MM-DD.md
├── eval-library/      # Behavioral scenarios + stress test variations
│   ├── base-scenarios.md
│   └── stress-variations.md
└── session-log.md     # What was worked on, decisions made, next actions

The .factory/ directory is not code. It is not documentation. It is infrastructure for the agent — the same way a package.json is infrastructure for a JavaScript project. Without it, the agent starts every session from scratch, makes decisions it shouldn't make, and produces inconsistent output. With it, every agent session begins from a known state.

When a developer picks up a project that has been dormant for three months, they read the README. When an agent picks up a project, it reads .factory/CLAUDE.md. Same purpose: establish context before work begins. The difference — and I learned this the hard way — is that the agent reads literally and ships on what it reads. So the configuration must be precise.

The Build Sequence

BUILD always follows the same sequence. No exceptions. No shortcuts.

Step 1: Pre-hydrate context. Before the agent writes a single line of code, feed it everything: the spec, the discovery document (if brownfield), relevant source files, and project conventions. The agent reads everything before it acts. An agent that starts coding from a partial understanding produces partial solutions.

Context hydration is not "paste the spec into the chat." It is a structured sequence: read .factory/CLAUDE.md first (project identity and hard constraints), then the spec section relevant to the current task, then the source files that will be touched, then any migration or schema files the change depends on. The order matters — constraints first, then requirements, then context. An agent that reads requirements before constraints will design solutions that violate the constraints and then have to be backed out.

For a project with a thirty-page spec, don't hydrate the whole document. Hydrate the section. A Tier 2 feature — adding a new report type to a dashboard — needs the report spec section, the relevant data model, and the existing report component. It does not need the asamblea governance module or the notification architecture. Over-hydration fills the context window with irrelevant material, and the agent will reference it anyway, sometimes inappropriately.

Step 2: Agent implements. The creative work. The agent reads the spec, designs the solution, writes the code. This is where the model's intelligence matters — its ability to understand requirements, choose appropriate patterns, and translate specification into implementation.

Step 3: Shift-left validation. After every file change — not after the feature is done, after every file — run lint and type-check. This must take less than five seconds. If validation catches an error immediately, the agent corrects it with full context. If validation catches it thirty minutes later, the agent has lost the context and the fix is expensive.

The five-second rule is not arbitrary. Language model context is sequential — the agent produces output in tokens, and by the time it has written three more files, the mental model for the first file is no longer the most recent thing in its window. An error caught immediately is caught while the agent still has the full reasoning for the decision that caused it. An error caught late requires reconstructing that reasoning from the output — which is slower, less accurate, and more likely to produce a fix that addresses the symptom rather than the cause.

Validation setup for a typical TypeScript project:

// package.json
{
  "scripts": {
    "lint": "eslint src --max-warnings 0",
    "typecheck": "tsc --noEmit",
    "validate": "pnpm lint && pnpm typecheck"
  }
}

The --max-warnings 0 flag is critical. It means warnings are errors. An agent that produces warnings-not-errors will leave a trail of lint warnings that accumulate across sessions until the codebase is cluttered with ignored advisories. Zero warnings enforced by the harness; zero exceptions.

Step 4: Deterministic guardrails. Git operations, formatting, dependency management — these are code, not agent decisions. The agent doesn't decide whether to commit, when to push, or how to format. The harness does. This eliminates an entire category of errors that have nothing to do with the agent's intelligence and everything to do with operational consistency.

The most important deterministic guardrails are the ones that protect against catastrophic decisions. In the BuildingManagementOS project, the CLAUDE.md includes a section called "Hard Boundaries (Never Violate)":

## Hard Boundaries (Never Violate)

1. Never open an asamblea session without verified quorum
2. Never publish an acta without explicit human review/approval
3. Never show financial documents to owners before the full sign-off chain completes
4. Never delete or modify compliance log entries (immutable audit trail)
5. Never allow an owner to be both physically present AND represented by 
   poder in the same asamblea
6. Never expose stack traces or internal errors to users
7. Never send automated substantive replies via any external channel — 
   all replies must come from a human decision-maker

These are Tier 4 constraints — the system handles legal compliance for building governance under Colombian law, and a violation of any of these rules could expose an administrador to legal liability or invalidate a legal assembly. They live in CLAUDE.md, not in the spec, because they are not behavioral requirements — they are non-negotiable boundaries that the agent must refuse to cross regardless of what it's been asked to build.

The distinction matters: the spec tells the agent what to do. The hard boundaries tell the agent what it can never do, regardless of instructions. Both are part of the harness.

Step 5: Capped iteration. Maximum two CI/test rounds. If the code doesn't pass in two attempts, surface the failure to a human with the full context of what was tried. Diminishing returns are real — an agent that fails on the third attempt usually failed on the first attempt in a way that more attempts won't fix.

The cap is not about optimism. It's about diagnosability. When a build fails, there are two explanations: the spec is ambiguous (the agent made a reasonable choice that doesn't match the requirement) or the model made a mistake (the spec was clear and the implementation is wrong). In either case, the human needs to know. A third attempt in the same direction produces a third variation of the same error — not a solution. The human intervention is the diagnostic step, not the giving-up step.

Step 6: Isolation. Work on branches or worktrees. Never touch main directly. If the agent produces something catastrophic — and it will, eventually — the damage is contained. You delete the branch and start over. Starting over, in the AI era, costs minutes, not weeks.

Claude Code's worktree feature makes this automatic: each agent invocation can run in a temporary git worktree, an isolated copy of the repository. The agent makes changes. If the session succeeds, the worktree is merged. If it fails, the worktree is discarded — and the main codebase is exactly as it was before the session began. No manual cleanup. No git resets. The isolation is structural, not procedural.

Session Continuity: The Handoff Problem

Here is something that matters and rarely gets discussed: language models have no memory between sessions. The agent that spent four hours implementing the authentication module yesterday has no memory of it today. Every session begins cold.

The harness solves this. Not through magic — through the session log.

At the end of every build session, before the context closes, the agent writes a summary to .factory/session-log.md:

## Session 2026-04-03

### What was done
- Implemented `compliance/quorum-checker.ts` — verifies coeficiente sum >= 50% + 1
- Added `compliance_log` table (immutable, insert-only via trigger)
- Quorum check wired into asamblea session opener

### Decisions made
- Used database trigger instead of application logic for compliance_log immutability
  — prevents any code path from bypassing the audit trail
- Quorum checks against TOTAL coeficientes, not just registered attendees
  — per Ley 675, Art. 45: quorum is calculated against total building shares

### What broke (and why)
- First implementation let the application write compliance_log entries directly
  — caught by validator, fixed to trigger-only approach

### Next session
- Wire quorum check into the asamblea wizard UI
- Add the in-progress indicator when quorum is calculating
- Blocked: need decision on whether partial-quorum warning blocks or just alerts

The next session opens by reading this log. The agent instantly has: what was built, why specific decisions were made, what broke and how it was fixed, and exactly what's next. It doesn't reconstruct this from the codebase. It reads it from the log. The session log is working memory that persists across the gap between sessions.

The "what broke" entry is the most valuable line in the log. It doesn't just record failures — it records the reasoning behind the fix. An agent that reads "first implementation let the application write compliance_log entries directly — caught by validator, fixed to trigger-only" understands both the pattern to avoid and why the current approach was chosen. Without that entry, the agent might make the same mistake again in a related module.

This is the harness creating organizational memory. Not the codebase — the artifact layer around the codebase.

When the Build Stalls

The capped iteration principle (two CI rounds, then surface to human) assumes the agent makes reasonable progress. Sometimes it doesn't. Sometimes the agent encounters a task where the spec is ambiguous, the codebase has a gap the spec didn't anticipate, or the implementation involves a pattern the model doesn't handle reliably.

When a build stalls — the agent produces two failing attempts without clear progress — the harness triggers a specific response. Not "try again with a better prompt." A structured diagnosis:

Was the spec clear? Read the relevant spec section. Is the behavior precisely defined, or does it rely on "should" and "typically"? If the spec is ambiguous, the build must stop. Update the spec first. Restart the build with a precise requirement. The agent cannot implement what hasn't been specified.

Is the implementation pattern within the agent's capability? Some patterns — complex state machines, real-time sync, intricate database triggers — are harder for agents to implement reliably on first attempt. If the stall is pattern-related, decompose the task. Break "implement the quorum check" into "write the quorum calculation function" and "wire it into the session opener" as separate sessions. A task that stalls as one unit sometimes completes cleanly as two.

Did the codebase change out from under the task? In multi-agent projects, a parallel agent may have changed a shared file that this agent was depending on. Check git status before diagnosing further. If there's a collision, resolve the conflict with the human present — this isn't an agent decision.

Is this a tier escalation signal? If the agent fails repeatedly on a task that seems straightforward, ask whether the failure pattern reflects hidden complexity. A Tier 2 task that requires multiple architectural decisions may actually be a Tier 3 task that was misclassified. Tier escalation changes the spec depth, the harness touchpoints, and the acceptable scope of agent autonomy.

The capped iteration rule exists because letting agents retry indefinitely produces one of two outcomes: the agent eventually gets lucky and produces a solution that passes CI but has subtle errors, or the agent burns through your entire token budget and produces nothing. Neither is acceptable. The human intervention is not failure — it's the harness working correctly.

The Human Intervention Pattern

When the harness triggers a human intervention — two CI rounds, no clean pass — the intervention has a structure.

The developer doesn't start by looking at the code. They start by looking at the task. Read the spec section the agent was implementing. Read the session log entry for this task. Then look at the two failed outputs side by side.

The comparison usually reveals one of three patterns:

Pattern A — Spec ambiguity. The two outputs differ in a decision the spec didn't make. Both implementations are reasonable interpretations of what was specified. The agent wasn't failing; it was choosing between valid options, differently each time. The fix is in the spec: make the decision, add it as a precise requirement, restart the build with the clarified spec. The agent will implement cleanly on the first attempt.

Pattern B — Pattern complexity. The two outputs are similar but both wrong in the same way — the agent is implementing a pattern it doesn't handle reliably. This isn't an intelligence failure; it's a task decomposition failure. Break the task into smaller units. Often a task that stalls as "implement the quorum validation feature" completes cleanly when decomposed into "write the coeficiente sum function" followed by "wire it into the asamblea opener." Smaller tasks have smaller spec surface areas and smaller failure modes.

Pattern C — Context gap. The implementation requires information the agent doesn't have — a dependency that isn't in the hydration set, a convention that isn't documented in CLAUDE.md, a constraint from a related module that the agent didn't know to check. The fix is to update the harness: add the missing convention to CLAUDE.md, add the dependency to the hydration sequence, add the cross-module constraint to the architecture decisions section. Then restart.

Pattern C is the most valuable of the three, because every Pattern C intervention improves the harness permanently. The constraint that caused this stall was always true — it was just undocumented. Adding it to CLAUDE.md means no future build session on this project will encounter the same gap. The stall was the harness discovering its own incompleteness, and the human intervention was the repair.

This is why the capped iteration rule isn't a concession to model limitations. It's a structured signal-capture mechanism. The agent runs twice. If it doesn't succeed, the human extracts the signal: spec gap, task gap, or harness gap. The harness improves. The next build is faster.

Configuring the Agent: The CLAUDE.md Walkthrough

The most important file in any Dark Factory project is .factory/CLAUDE.md. Before the agent writes a single line of code, before the spec is opened, before any tool is called — the agent reads this file. Everything in it is treated as standing instruction. Anything not in it, the agent will infer or invent.

Here is what a complete CLAUDE.md contains, and why each section exists.

Project identity. One paragraph. What is this? Who uses it? What does it not do? This isn't a marketing description — it's an orientation for an agent that has no memory of the previous session. "BuildingManagementOS is a web-based platform for managing propiedad horizontal in Colombia under Ley 675 de 2001. It serves building administrators as the primary user." The agent now knows it is building a compliance-regulated, multi-tenant, Spanish-language system before it reads anything else.

Spec reference. A single line pointing to .factory/spec.md. Not a summary of the spec. A pointer. The spec is the behavioral contract; this is the instruction to read it. "The full specification lives in spec.md. This is a living document — patch it, don't rewrite it." That last clause matters: the agent is explicitly told not to rewrite the spec when it finds ambiguity. Patch it. Preserve history.

Tech stack table. The complete technology inventory — framework, database, auth, ORM, styling, deployment. Not explained, just declared. The agent uses this to make correct import decisions, avoid introducing incompatible libraries, and understand the boundaries of each layer.

| Layer    | Technology        | Purpose                    |
|----------|-------------------|----------------------------|
| Framework| Next.js 15 (App)  | Full-stack, deployed Vercel |
| Database | Supabase (Postgres)| Multi-tenant, RLS          |
| Auth     | Clerk             | Sessions, RBAC             |
| ORM      | Drizzle           | Type-safe queries          |

Hard boundaries. Non-negotiable constraints that the agent cannot override regardless of instruction. Every Tier 3-4 project has at least one. They are stated imperatively: "Never publish an acta without explicit human review." Not "you should check for review" — never publish. The imperative form is intentional: the agent reads for instruction, and soft language produces soft compliance.

Common commands. The exact shell commands for building, linting, type-checking, running migrations, starting the dev server. The agent uses these commands when it needs to validate work. Having them wrong means the agent validates against the wrong toolchain or times out on a command that doesn't exist.

pnpm lint        # ESLint, max-warnings 0
pnpm typecheck   # tsc --noEmit, strict mode
pnpm db:generate # Drizzle migration generation
pnpm db:migrate  # Apply migrations to Supabase

Architecture decisions. Not the full spec — just the decisions that are likely to be violated if the agent doesn't know about them. "Multi-tenancy: every data table includes building_id. Supabase RLS policies enforce isolation." An agent building a new table without that constraint could silently break multi-tenant isolation in a way that wouldn't surface until a second tenant registers.

What this does NOT do. The explicit exclusion list. "No payment processing. No AI-generated narrative in legal documents. No offline capability." This prevents scope creep — the agent adding a payment form because the spec mentioned subscriptions, or adding an AI summary because it seemed helpful. Out-of-scope is coded as prohibition.

Here is what those sections look like assembled into a single file for BuildingManagementOS — a Tier 4 compliance system. This is the file the agent reads before every session.

# BuildingManagementOS — Project Configuration

## Project Identity
Web-based platform for managing propiedad horizontal (residential building communities)
in Colombia under Ley 675 de 2001. Primary user: building administrators (administradores).
Secondary users: owners (copropietarios) and the supervisory board (consejo).
Tier 4 — legally-binding governance and compliance. Every output is subject to Colombian law.

## Spec Reference
Full specification: `.factory/spec.md` — living document, patch it, don't rewrite it.
When you find ambiguity, add a clarifying note to spec.md. Do not fill gaps by inference.

## Tech Stack
| Layer     | Technology          | Purpose                              |
|-----------|---------------------|--------------------------------------|
| Framework | Next.js 15 (App)    | Full-stack, deployed to Vercel       |
| Database  | Supabase (Postgres) | Multi-tenant, RLS enforcement        |
| Auth      | Clerk               | Sessions, role-based access control  |
| ORM       | Drizzle             | Type-safe queries, migration history |
| Styling   | Tailwind CSS        | Utility-first, no CSS modules        |
| Language  | TypeScript (strict) | No `any` except documented exception |

## Hard Boundaries (Never Violate)
1. Never open an asamblea session without verified quorum (Ley 675, Art. 45)
2. Never publish an acta without explicit human review and approval
3. Never show financial documents to copropietarios before the full sign-off chain completes
4. Never delete or modify compliance_log entries (immutable audit trail)
5. Never allow an owner to be both physically present AND represented by poder in the same asamblea
6. Never expose stack traces or internal error messages to users
7. Never send automated substantive replies to copropietarios —
   all replies require a human decision-maker

## Common Commands
pnpm dev          # Start development server (localhost:3000)
pnpm lint         # ESLint, --max-warnings 0 (warnings = errors)
pnpm typecheck    # tsc --noEmit, strict mode
pnpm validate     # lint + typecheck — run after every file change
pnpm db:generate  # Generate Drizzle migration from schema changes
pnpm db:migrate   # Apply migrations to local Supabase

## Architecture Decisions
- Multi-tenancy: every data table includes building_id.
  Supabase RLS enforces isolation — never query without a building_id filter.
- Compliance operations: all compliance logic lives in src/lib/compliance/.
  Never inline compliance checks in components or route handlers.
- Data access: all queries go through src/lib/db/queries.ts.
  Never call Supabase client directly from components.
- Server-first: prefer Server Components and server actions.
  Client state for UI ephemera only (modals, loading states).
- Audit trail: compliance_log writes happen via database trigger only.
  Application code must never write to compliance_log directly.

## Out of Scope
- Payment processing (subscriptions handled externally)
- AI-generated narrative in legal documents (all legal text is human-authored)
- Offline capability
- Condominium regimes outside Colombia

The CLAUDE.md is not read once at project setup and forgotten. It is read at the beginning of every agent session. That means it must stay current — when the tech stack changes, update CLAUDE.md. When a new hard boundary is discovered (usually after a near-miss), add it immediately. The CLAUDE.md is the harness's standing orders. Out-of-date standing orders produce out-of-date behavior.

A Real Harness Configuration

The example above is assembled for the chapter. What follows is excerpted from the actual CLAUDE.md running in production on Edifica — the propiedad horizontal platform we ship for Colombian residential buildings under Ley 675 de 2001. Tier 4. Every session in that repo reads this file first.

## Hard Boundaries (Never Violate)

1. Never open an asamblea session without verified quorum
2. Never publish an acta without explicit human review/approval
3. Never show financial documents to owners before the full sign-off
   chain completes (Accounting Firm → Revisor Fiscal [if applicable] → Admin)
4. Never delete or modify compliance log entries (immutable audit trail)
5. Never allow an owner to be both physically present AND represented
   by poder in the same asamblea
6. Never expose stack traces or internal errors to users
7. Never send automated substantive replies on behalf of the Admin via
   any external channel (WhatsApp, Email, SMS) — all replies must come
   from a human decision-maker

## What This System Does NOT Do (MVP)

- No payment processing or accounting functions
- No access control (QR codes, visitor management, cameras)
- No common area reservations
- No AI-generated narrative in legal documents
- No WhatsApp/email integration for system events
- No offline capability (standard PWA, assumes connectivity)
- No tenant (non-owner) or investor-owner role variants

## Common Commands

pnpm dev          # Run dev server (http://localhost:3000)
pnpm lint         # Run linter
pnpm typecheck    # Type-check without emitting
pnpm test         # Run tests
pnpm db:generate  # Generate Drizzle migration after schema change
pnpm db:migrate   # Apply migrations to Supabase

I'll admit — the first version of this file was half this length, and we paid for it. Every missing rule was a near-miss we had to clean up, and every near-miss got added back as a line. What's in the file now is scar tissue turned into standing orders.

Three things to notice. The hard boundaries are the permanent floor. The agent cannot cross them regardless of what the spec says, what a user asks for, or what seems "helpful" in the moment. They are not preferences — they are the non-negotiable layer that protects the administrador from legal exposure under Colombian law. This is what a Tier 4 moat looks like at the config level.

The "does NOT do" list is the scope fence. Without it, the agent infers. It sees "financial documents" in the spec and offers to add a payment form. It sees "legal documents" and offers to generate narrative. Every inference is a bottleneck waiting to happen. The explicit prohibition is cheaper than the cleanup.

The common commands are the deterministic rail. Same commands, every session, every developer, every agent. pnpm lint, pnpm typecheck, pnpm test — invoked the same way whether the agent is adding a field to a schema or wiring a new inbox channel. No flexibility here is a feature. This is the niche where reliability comes from boring repetition, not cleverness.

The Skills Layer

The build sequence describes the infrastructure around the creative work. The skills layer is the creative work itself.

In the Dark Factory methodology, skills are pre-built, testable units of agent capability — each skill encodes a discrete task, a prompt optimized for that task, the tools it requires, and the expected output format. Where the harness governs how the agent operates, skills govern what the agent does within each operation.

The relationship is simple: the spec defines what needs to be built. The skills define how to build it. The harness defines the conditions under which building happens.

A spec for a marketing automation platform might define: "The system must produce a competitive analysis given a brand name and market segment." That's a behavioral requirement. The competitor-analysis skill is what implements it — a structured prompt that calls a search tool, extracts specific data points, and produces a formatted output. The skill can be developed, tested, and improved independently of the platform. When the platform builds the competitive analysis feature, it instantiates the skill.

This matters for build quality because testable skills produce testable features. A skill that has been validated in isolation — checked against known inputs, stress-tested with stressor variations, confirmed to produce consistent output — is dramatically easier to wire into a platform than a prompt written inline during the build session. The build session becomes assembly, not improvisation.

The VZYN Labs pivot illustrates this in reverse. The original thirteen-agent architecture had no skill catalog — logic was embedded in agents, each one a bespoke implementation of a marketing function. When the pivot simplified to a single agent, the rebuild didn't just change the architecture. It extracted fifty-seven discrete skills from the old implementation and made them individually testable. "It is easier to specify fifty-seven skills than to specify thirteen orchestrated agents." The spec became manageable, the build became repeatable, and the testing became possible.

For a new project, the skill catalog is built during SPEC, before BUILD begins. The BUILD phase assembles the skill catalog into the product. For each skill, the build question is the same: does the skill fire when it should, with the right tools, and produce the expected output format? This is a narrow, verifiable test — much easier to answer than "does the feature work?" which is the question you're forced to ask when logic is embedded rather than extracted.

Greenfield vs. Brownfield Builds

The build sequence is the same for greenfield and brownfield projects. The pre-hydration is not.

Greenfield builds start from a spec and an empty codebase. Pre-hydration is: CLAUDE.md, the spec section for the current feature, and the data model (schema files). The agent has no prior codebase to navigate and no architectural patterns to inherit. This is the simplest build environment — the agent makes all the design decisions within the constraints of the spec and tech stack.

Brownfield builds start from a spec and an existing codebase. Pre-hydration is: CLAUDE.md, the spec section, the data model, and the discovery document.

The discovery document is the artifact that makes brownfield builds tractable. It contains:

## System Map
- Entry points: `src/app/api/` (route handlers), `src/app/(dashboard)/` (UI routes)
- Data access: all queries go through `src/lib/db/queries.ts` — never direct Supabase calls from components
- State management: server-side only — no client state except UI ephemera

## Conventions
- Server actions in `src/app/actions/` — named `[domain]-actions.ts`
- Zod schemas in `src/lib/validators/` — shared client/server
- All compliance rules in `src/lib/compliance/` — never inline

## Change Impact Map
- Schema changes require: db:generate → db:migrate → update affected queries → update Zod validators
- New compliance rule requires: add to compliance/ → add to compliance_log trigger → add hard boundary to CLAUDE.md
- New user-facing feature requires: route handler → server action → UI component (in that order)

The discovery document is the codebase map. It tells the agent where things live, what the conventions are, and what changes require cross-file coordination. Without it, the agent navigates by inference — reading file after file to build a mental model of the system. With it, the agent reads the map, goes directly to the right location, and makes changes within the established patterns.

Brownfield builds without a discovery document are where "crazy things" live. The agent can't see the full system; it sees what's in its context window. It makes changes that are locally correct but globally inconsistent — using a different naming convention, making a direct database call that bypasses the query layer, adding state where the architecture uses server-side rendering. None of these are intelligence failures. They are navigation failures. The discovery document is the navigation infrastructure.

The Architecture Constraint

The harness can only do so much. If the underlying architecture is too complex for an agent to navigate productively, no amount of CLAUDE.md configuration will fix it.

This is the lesson from VZYN Labs. The engineers chose hexagonal architecture — a pattern designed for enterprise-grade systems with complex domain logic. For an unvalidated MVP with a Tier 2 risk profile. The codebase was structurally correct: clean, layered, respecting separation of concerns at every level. And completely unusable with AI coding tools.

"So many layers of complexity before it could actually be productive." The agent couldn't find the right entry point for a change without traversing multiple abstraction layers. It couldn't add a feature without touching adapters, ports, and domain objects that all needed to stay in sync. It couldn't read the codebase holistically because the relevant logic was distributed across files that didn't reference each other directly.

The test I use before committing to an architecture: can an agent, given only the file structure, answer the question "where does X happen?" If yes, the architecture is navigable. If the answer requires understanding the abstractions first, it's too complex for agent-assisted development at this stage.

Flat architectures pass this test easily. The Ecomm project chose Postgres with pgvector — a flat, predictable data layer with no abstraction overhead. Business logic lives in server actions. There are no service layers, no ports-and-adapters, no domain object hierarchies. An agent given the schema and a task can usually find the right file on the first try.

This doesn't mean you should always build flat systems. Large teams, long-lived codebases, and systems with genuinely complex domain logic benefit from structured architecture. But the decision must be made consciously, with knowledge of what it costs in agent navigability. A Tier 2 MVP should be as flat as possible. A Tier 4 production system may warrant more structure — but the harness must compensate with a richer discovery document that maps the terrain.

The architecture constraint is the one dimension the harness cannot fix after the fact. You can update CLAUDE.md. You can add validation commands. You can tighten the hard boundaries. But if the architecture is fundamentally too layered for agent navigation, the only fix is rebuilding with a flatter structure — which is what VZYN had to do, two months and many engineer-hours after the wrong choice was made.

Why the Harness Matters More Than the Model

The VZYN Labs failure wasn't a model failure. The models could generate code. The failure was architectural — the hexagonal structure was so complex that even the AI agent couldn't be productive within it. "So many layers of complexity before it could actually be productive."

The Ecomm project made the opposite choice. Postgres with pgvector — a flat, predictable architecture that any agent can navigate. The structure is simple enough that the agent's creative work stays focused on business logic, not on navigating layers of abstraction.

The harness creates the environment where the agent can do its best work. A good harness is invisible — the agent doesn't fight it, doesn't work around it, doesn't even notice it. A bad harness — or no harness — means the agent makes decisions about things it shouldn't decide, produces inconsistent operational behavior, and occasionally produces "crazy things" that require a human to catch.

Hernan will get there. "I think as I polish my prompts and see the results coming out the way I want," he told me, "then I'll say 'do it' and go have a coffee."

He wasn't wrong about the prompts. Prompt quality matters — a well-structured instruction produces better first drafts than a vague one. But the roller coaster he described — the sessions that produced exactly what he needed, the sessions that produced "crazy things" — wasn't caused by prompt variance. It was caused by harness absence.

When the sessions succeeded, they succeeded because the task was self-contained, the context was naturally bounded, and there were no operational decisions for the agent to make. When the sessions produced crazy things, it was because the task touched something the agent had to infer — the project conventions, the naming patterns, the extent of what it was authorized to change. Without a CLAUDE.md, the agent answers those questions itself. Sometimes correctly. Sometimes not.

The coffee Hernan wants is a build session that starts with full context, validates at every step, operates within hard boundaries it can't cross, and ships a predictable result. Not a prompt — a harness. The harness is what makes the coffee possible, and what makes the coffee consistent.

The next layer of the stack is what happens after the agent ships. Testing isn't a checkpoint here; it's how we find out whether the harness is actually holding.

The next chapter covers what happens after the build is complete — the four-layer testing methodology that validates behavior under adversarial conditions.

← All chapters