Part Getting To Work

Chapter

Getting to Work

What do you do with this?

What You Now Know

You know that the bottleneck moved. Implementation is no longer the constraint — a model can generate a working feature from a good spec in minutes. The constraint is specification: the quality of what goes into the machine. Every ambiguity in the spec is a decision the agent makes without telling you, and those decisions compound.

You know why they compound. A ten-step pipeline at 90% per step produces a 35% end-to-end success rate. Not a bug — math. The fix requires three layers: skills that get each step to 90%, a harness that pushes the system to 99%, and evaluation that catches the rest. Most teams have the first layer. Almost nobody has all three.

You know why the metrics you use to measure progress are lying to you. DORA measures pipeline speed. AI makes pipelines faster. DORA turns green. The code gets worse. The measurement framework was built for a world where humans write code and machines run it. That world is ending. The replacement — AOME — measures fleet output, orchestration quality, capability horizon, escalation health, and context integrity. Nobody has automated scorecards for this yet. That's a problem to solve.

You know that the resistance to AI in enterprises — the 27% of organizations that have banned generative AI tools, the 30% of U.S. banks that prohibit it entirely — is not irrational. It's the correct risk posture for organizations that haven't built governance infrastructure for systems they can't control. The discovery document, the intent contract, the trust tier, the mandatory oversight at Tier 4 — these are the governance infrastructure. The organizations that build it will deploy. The ones that don't will either ban or suffer.

You know how reliability is built. Specification → Harness → Evaluation. In that order. Each layer depends on the ones before it. There are no shortcuts.

Three Roles

Every AI agent project requires three human roles. They can be played by one person, two people, or three, depending on the project's scale. They cannot be compressed to zero.

The Spec Architect

The person who writes the specification. Not necessarily an engineer. The most important qualification is the ability to make decisions explicitly — to look at an ambiguous requirement and resolve it precisely enough that an agent can implement it without asking a clarifying question.

The spec architect needs two things: domain expertise (knowing what the system should do in the domain) and specification skill (encoding that knowledge in a format that leaves no important decisions unmade). These don't always live in the same person. A pharmacist knows what a medication guidance system should do. A specification-skilled builder knows how to encode that knowledge as a behavioral contract. The partnership between these two produces the most reliable specs.

I didn't know how to write code — but I understood how to write a brief. Fifteen years in marketing, a business degree, no engineering training turned out to be the right preparation for spec architecture. Not because marketing resembles software engineering, precisely because marketing is fundamentally a specification discipline. Every campaign brief defines the audience (users), the message (behavior), the success metric, and the constraints. That structure translates directly. What changed was the precision required — specs for agents must be more explicit, more complete, and more honest about what isn't known than most marketing briefs.

The spec architect's job is never done. Specifications drift. The system changes, the domain changes, the organization's priorities change. The spec architect maintains the living document that keeps the agent's behavioral contract accurate over time.

The Agent

The AI system that implements the specification. In practice, this means Claude Code, or Codex, or whatever execution environment is available — with the harness configuration that turns a general-purpose tool into a project-specific tool.

The agent's job is creative within the bounded space the spec and harness create. It chooses patterns, writes code, implements behavior. The harness limits what it can do. The spec directs what it should do. The evaluation architecture validates whether it did it.

The agent is not a person. But treating it as a role clarifies something important: the agent is the middle of the pipeline, not the whole thing. The organizational instinct is to focus on the agent — to pick the best model, craft the best prompts, optimize the middle layer. Pike's Rule 5 says otherwise: data structures, not algorithms. The spec and intent contracts are the data structures. Get those right, and the agent's behavior follows.

The Validator

The person who audits the output. Critically: a different person from the spec architect.

The same person who wrote the specification is the worst person to verify whether it was implemented correctly. They see what they intended, not what was built. They skip the edge cases they decided not to handle because they remember deciding not to handle them. They read the code charitably, filling gaps with their knowledge of the system.

The validator reads the code naively — like someone who will maintain it, not like someone who wrote it. They ask: does this do what the spec says? Does it do anything the spec doesn't say? Are there behaviors the spec explicitly prohibits that the implementation might accidentally allow?

Engineers are natural validators. Technical depth, professional skepticism, the habit of asking "what happens if?" — these are validation skills. The career insight for engineers in the AI era: validation is where the judgment is hardest to automate, where domain knowledge matters most, and where errors have the highest cost. Being the person who catches what the agent missed is a more defensible moat than being the person who writes what the agent could have written.

Samir — the developer who fixed 7,000 bugs in a spec-driven system — described what validators do better than any framework: "You make a change, and you go back to the spec and ask: is what I did aligned with what's in there?" Without the spec, there's nothing to compare against. With the spec, the validator has a document the agent can be held to — and caught lying against.

The Shape of a First Project

At full description, the methodology is intimidating. Eight phases, four trust tiers, twelve harness principles, four evaluation layers, two-layer intent engineering. Don't apply all of it to your first project. Apply the right amount for your first project's tier.

If your first project is Tier 1 or 2 (worst realistic outcome: wasted time or money):

Start here:

Write a project card. Three paragraphs: what you're building, who uses it, what failure looks like.
Set a scope boundary. What is explicitly out of scope.
Write a minimal spec. Use the eight-section format, but keep each section brief. The spec should be completable in three to four hours.
Create a .factory/ directory. Write a CLAUDE.md with project identity, tech stack, and any hard boundaries you can identify.
Build with validation. Lint and type-check after every file. Track what breaks and why in the session log.
Review the output. Does it do what the spec says?

That's it. Full methodology for Tier 1. You'll miss some of it. You'll do some of it imperfectly. The spec will have gaps. The harness will be thin. But you will have done the one thing that matters most: you will have written down what you're building before you built it. The difference that makes — in clarity, in agent reliability, in the speed of debugging when something goes wrong — is the fastest way to understand why the rest of the methodology exists.

If your first project is Tier 3 (worst realistic outcome: financial or legal damage):

Add:

A full spec, all eight sections at depth. Plan four to eight hours.
An intent contract with the cascade of specificity for the primary goal.
Seven behavioral scenarios with three stress variations each.
Human review before any consequential output reaches a user.
A defined validator role — a different person from the spec architect.

If your first project is Tier 4 (worst realistic outcome: safety, legal, or irreversible harm):

Don't start a Tier 4 project with this methodology as your first experience of it. Run a Tier 2 project first. Get comfortable with the spec format, the session log pattern, the harness basics, the behavioral scenarios. Tier 4 requires the same discipline at five times the overhead. The discipline has to be internalized before the overhead can be managed.

If you're already in a Tier 4 project and the methodology is new, start where you are. Write the spec for the next feature, not the whole system. Build the harness around the next build session, not the whole codebase. Add the evaluation scenarios incrementally. Imperfect methodology applied to a live project is better than perfect methodology applied too late.

What Not to Do First

The "getting started" advice has a mirror: the things that look like starting but aren't.

Don't start by picking the model. The choice between GPT-5, Claude, or Gemini is the last decision that matters, not the first. The model operates inside the harness. A well-specified system on a capable model ships better output than a poorly-specified system on a frontier model. The model is the engine; the spec is the destination; the harness is the car. Spending your first weeks on model comparison is optimizing the engine while leaving the destination undefined and the car unbuilt.

Don't start by building the most sophisticated architecture you can imagine. Pike's Rule 3: fancy algorithms are slow when n is small, and n is usually small. Multi-agent orchestration, self-improving evaluation loops, dynamic capability routing — these are advanced patterns that make sense when you have a validated use case, a working simple system, and evidence that complexity is required. They don't make sense as a starting architecture for an unvalidated concept. Start with one agent. Add agents when one agent proves insufficient.

Don't start with the hardest problem. The methodology is a discipline. Disciplines require practice before they're reliable under pressure. A Tier 4 system with safety consequences is not the place to practice. A Tier 1 or Tier 2 project — low stakes, reversible failures, quick feedback loops — is where you build the muscle memory. The spec format, the session log habit, the behavioral scenario structure, the harness basics: learn them on a project where getting them wrong doesn't matter, before applying them to a project where it does.

Don't mistake prompt quality for spec quality. A well-crafted system prompt is not a behavioral contract. Prompts are the invocation. Specs are the substance. The test is whether someone who doesn't know the system can read the spec and determine whether a given agent output is correct. If the spec is just a system prompt, the answer is no — because the prompt describes how to talk to the agent, not what the agent is supposed to do.

Don't skip the session log. The first thing that gets dropped when the work feels urgent is the session log. The next session starts without it. The agent has no context. The build stalls while the agent reconstructs what was done in the previous session from the codebase. The session log takes five minutes to write. The cost of not writing it shows up in the twenty minutes of context reconstruction it prevents.

These aren't exotic failure modes. They're the patterns that appear in almost every team's first AI project. The methodology exists to prevent them systematically. But the prevention only works if you run the methodology, which means starting where it starts: intake, then spec, then harness, then build.

The Skills That Will Compound

Most skills in software development are about doing — writing code, debugging systems, shipping features. The skills the methodology requires are mostly about knowing and deciding.

Specification skill: The ability to look at a problem and describe what a correct solution does, precisely enough that an agent can implement it without asking questions. This is the skill that compounds hardest in the AI era. It's also the most transferable — the same precision that makes agent specs effective also makes business requirements clearer, design briefs better, and engineering conversations faster.

Domain literacy: Understanding a domain well enough to know what matters and what doesn't. Not expertise — you don't need to be a pharmacist to write the Ecomm spec. But enough familiarity to ask the right questions, recognize when an answer is incomplete, and know which edge cases are safety-critical and which are trivial. Domain literacy compounds over time: each project in a domain builds pattern recognition that makes the next spec in that domain better.

Evaluation judgment: The ability to look at an agent's output and tell whether it's correct — not just whether it looks right. This requires understanding the difference between "the agent did what I said" and "the agent did what I meant." It requires the patience to trace outputs back to the behavioral contracts that were supposed to produce them. It's a skill that gets faster with practice and slower without it.

Intent engineering: The ability to make organizational trade-offs explicit. Not just documenting them — making them machine-actionable. Most organizations have implicit trade-offs that govern every significant decision. Encoding them as intent contracts makes those trade-offs deliberate, auditable, and maintainable.

These are not prompt engineering skills. They're not model-specific. They're not tied to any particular tool or platform. They're the skills of building reliable systems from specifications — which was valuable before AI and is more valuable now.

How to Introduce This in an Organization

The hardest part of applying this methodology inside an existing organization isn't the technical implementation. It's the organizational negotiation.

Most organizations have existing AI initiatives, existing approval processes, and existing definitions of what "using AI well" looks like. The methodology in this book is likely more rigorous than those definitions. Introducing it requires navigating the gap between what the organization is comfortable with and what reliable AI systems actually require.

The approach that works is not "our methodology is better than what you're doing." It's "let me show you what this looks like on one project."

Pick the right project. Not the highest-stakes system — that's not where you want to learn on the job. Not the lowest-stakes system — that won't demonstrate enough to be compelling. Pick a Tier 2 or Tier 3 project where:

Someone in the organization genuinely wants a better outcome
You have access to the domain expert who can help write the spec
The consequences of a careful process are visible (a system that works reliably) and the consequences of a careless one are also visible (a system that doesn't)

Run the project with full methodology. Write the spec. Build the harness. Run the scenarios. Produce the certification report. Show the artifact chain: here's what the system was supposed to do (spec), here's how we verified it does it (test results), here's how we know when it stops doing it correctly (evaluation flywheel).

That artifact chain is the argument. Most organizations haven't seen AI projects produce it. The projects they've run produced prompts, demos, and deployed models that nobody could verify or maintain. Showing the alternative — a documented, verifiable, maintainable system — converts more people than any description of the methodology would.

The organizations that adopt this approach will ship better AI systems faster than the ones that don't. That's the moat. But the adoption starts with one project, done right, that demonstrates what "done right" looks like.

What to Do on Monday

If you're starting from zero and you want to apply what this book describes, here's the sequence.

This week: Find a problem you understand well enough to specify. Not the most important problem — a problem where you have genuine domain knowledge. Write a project card for it. Three questions: what are you building, how dangerous is failure, is it new or changing something that exists? Set the tier. State the worst realistic outcome explicitly.

Next week: Write the spec. Use the eight-section format from Chapter 7. Spend real time on the behavioral scenarios — seven situations the system will encounter and what the correct behavior is in each. If you can't write the scenarios, you don't understand the problem well enough to spec it yet. That's valuable information: go learn more about the domain before specifying.

The week after: Build a minimal harness. Create the .factory/ directory. Write the CLAUDE.md with project identity, tech stack, and any hard boundaries you know. Write the session log format. Set up linting that fails on warnings, not just errors. Run one build session. Write the session log at the end. Start the next session by reading it.

The month after: Run the behavioral scenarios against what was built. Find the ones that fail. Diagnose: spec gap, model gap, or harness gap? Fix the root cause. Run again.

This is the full cycle. Intake → spec → harness → build → test → certify → deploy → maintain → back to spec. The first cycle will be slow and imperfect. The second will be faster. The tenth will be the methodology working at the speed it's designed to work at.

You will not get it right the first time. That's not the goal. The goal is to build the habits — write before you build, enforce rather than prompt, measure before you optimize — and let those habits improve with each project.

The People This Is For

This book was written for the people in the middle of the transition: not the early adopters who were already building before the frameworks existed, and not the late adopters who will pick this up after the methodology is established practice. The people in the middle — who understand that something has shifted, who have seen both the potential and the chaos, who need a systematic approach rather than another set of tips.

Hernan, who wondered whether a good spec would make him unnecessary: the methodology gave him his answer. His job didn't disappear. It changed. He became the person who understands the spec well enough to break it into tasks, who catches the agent when its implementation drifts from intent, who validates with the reference that makes catching possible. The spec made him more valuable, not less.

Joen, who learns in an environment where AI is assumed: the methodology gives him a framework for the skills he's building intuitively. Knowing when a problem is Tier 1 versus Tier 4. Understanding why intent contracts matter. Building the habit of specifying before building. These are the skills that distinguish someone who uses AI effectively from someone who uses it casually.

Francisco, who felt threatened: the methodology makes explicit what was always true. His domain expertise — forty years of accumulated judgment about what matters in his industry — is the most valuable input into any AI system that operates in his domain. The methodology doesn't replace that judgment. It creates the infrastructure for it to be encoded, preserved, and acted upon at scale.

Samir and Carlos, who adapted: the methodology gave them the map they'd been missing. Every bug traceable to a spec requirement. Every change verifiable against a behavioral contract. The agent honest because the spec gave it something to be held to. Their work didn't get easier. It got cleaner.

The methodology is for everyone who's ready to take the transition seriously — who understands that the bottleneck moved and wants to build the discipline to work at it.

Where This Goes

The book ends here. The projects don't.

Regasificadora is in Phase 1. The operational manuals for Ecopetrol aren't finished. The executive decision platform is on the roadmap for Phase 2. By the time you read this, it may have succeeded, failed, pivoted, or been rebuilt from a better understanding of what it should have been.

Ecomm hasn't deployed. The discovery phase revealed enough about the existing wiki that the specification required rethinking twice. The QA team is involved in the scenario library. The first production test will probably break something the spec didn't anticipate. That's fine. That's what the evaluation flywheel is for.

Edifica has its first clients — small building administrations in Medellín. The governance module is live in supervised mode. No assemblies have been held through the system yet. The first assembly will reveal whether the spec's quorum logic actually matches Colombian practice under Ley 675, or whether there's an edge case the law creates that the spec didn't.

VZYN is running the pre-audit playbook on Stark. The first report will either open a conversation or it won't. If it doesn't, the diagnosis starts with the report quality and works backward through the spec.

Every one of these projects is an application of the methodology. Every one of them will produce findings that the methodology doesn't yet account for. Those findings will improve the methodology — the evaluation library will grow, the spec templates will get better, the harness configurations will get tighter.

This is how methodologies evolve: not through theory, but through practice that reveals where theory was incomplete.

The Long Game

One project won't make you proficient at this. The methodology compounds with practice.

The first project produces a spec that has gaps. Some of those gaps produce build stalls that you diagnose and fix. Others produce evaluation failures that you discover in testing. A few make it to the certification review, where the certifier (or you, reading your own work with fresh eyes) catches them. A small number make it to deployment, where they produce the kind of behavior that real users are excellent at eliciting.

Every gap you catch and fix produces two things: a better system, and a better understanding of how to write the next spec. The behavioral scenario you added after a build stall is now in your pattern library. The hard boundary you encoded after a near-miss is now part of your CLAUDE.md template. The intent contract clause you added after the Klarna-style drift becomes the first clause you write in every intent contract for similar systems.

The second project is faster and better than the first. Not dramatically — you'll make new mistakes — but systematically better on the specific failure modes you encountered before. The third is faster than the second. By the fifth, the methodology isn't something you're following. It's something you're thinking with.

This is the compounding that matters. Not the technical compounding of models improving or capabilities expanding — that happens whether or not you invest in methodology. The compounding of your specification skill, your evaluation judgment, your harness instincts. Those are yours. They don't depreciate when the next model releases. They appreciate with every project that tests and refines them.

The organizations that will win the next decade of AI development aren't the ones with the earliest access to frontier models. Access is commoditizing. The ones that will win are the organizations that build the institutional capacity to specify reliably, enforce deterministically, and evaluate continuously — at scale, across projects, with a team that knows how to work at the new bottleneck.

Build that capacity. One project at a time. Start the next one before you finish reading this page.

The Equation, One Last Time

The book's central argument is one equation:

Spec quality × harness enforcement × continuous evaluation = reliable software

The multiplication is not metaphorical. Any factor at zero produces zero. A perfect spec with no harness produces unreliable behavior. A perfect harness with no evaluation produces undetected drift. A comprehensive evaluation library with no spec produces evaluation without a standard.

All three factors must be nonzero. All three improve with investment. All three compound over time — a better spec produces fewer harness interventions; a tighter harness means evaluation catches edge cases rather than basic failures; comprehensive evaluation reveals spec gaps that make the next spec better.

The bottleneck moved. The discipline to work at the new bottleneck exists. The question is whether you'll build the habits — specification first, harness always, evaluation continuously — before the cost of not having them becomes obvious.

Francisco said it best, the first time he understood what I was describing: "The thing I know best is now the hardest thing to express."

He was right about the hard part. The domain knowledge, the judgment, the years of accumulated expertise — these are now the most valuable inputs into the machine. But expressing them precisely enough for a machine to act on them is new work. It requires new habits. It requires writing things down that used to live only in someone's head.

That's the transition. Not that human expertise became less valuable — the opposite. It became the bottleneck.

The methodology in this book is how to work at the bottleneck — deliberately, repeatably, in a way that gets better with every project.

Start with one project. Write the spec. Build the harness. Run the scenarios. Let the evaluation tell you where you're wrong.

Then ship the next one.

← All chapters