Part The Shift

Chapter

The March of Nines

Here is a number that should change how you think about AI agents: 65%.

A ten-step agentic workflow where each step succeeds 90% of the time produces an end-to-end success rate of about 65%. That means roughly one in three runs fails. At ten runs per day, that's six or seven failures daily. Not catastrophic failures — not the kind that crash the system. The quiet kind. A step that skips validation. A classification that's slightly off. A recommendation that's reasonable but wrong. Each one plausible enough to pass casual inspection. Each one compounding into a system that works most of the time and fails unpredictably.

Now push each step to 99% reliability. The same ten-step pipeline produces approximately 90% end-to-end success — about one failure per day. Better. But in a Tier 4 system handling prescription medication referrals, one failure per day is still one patient at risk per day.

Push to 99.9% per step: one failure every ten days. Push to 99.99%: one failure every hundred days. This is the March of Nines — the relentless, compounding requirement for reliability that separates demo-grade systems from production-grade ones.

Most teams ship the first layer: prompts that get each step to roughly 90%. Almost nobody builds the second and third.

The Naive View

When Yuval Noah Harari wrote about AI systems in Nexus,¹ he described what he called the Naive View: the assumption that connecting capable components produces a capable system. Add a capable language model to a capable data pipeline connected to a capable output system, and you get a capable application.

The Naive View is wrong, and Claude Shannon proved it in 1948.²

Shannon's information theory established the fundamental problem of communication: every channel introduces noise. Every transmission degrades the signal. The degradation compounds across channels. By the time a message traverses ten noisy channels, the accumulated entropy can make the output unrecognizable — even if each individual channel was "pretty good."

Shannon was describing telegraph lines and radio transmissions. But the math describes AI agents exactly. Every step in an agentic pipeline is a channel. Every LLM call introduces noise — not random noise, but probabilistic variation that looks reasonable, passes surface inspection, and accumulates silently across steps. The only solution Shannon identified was redundancy: error correction, checksums, verification. You have to add structure that isn't in the original signal.

For AI agents, that structure is the harness.

The Naive View is understandable. Human cognition is bad at compound probability. If someone told you each step succeeds 90% of the time, you'd probably estimate a ten-step pipeline succeeds around 90% of the time — maybe a bit less. The actual answer is 35%. Your intuition is off by more than a factor of two.

This intuition gap — the distance between "90% is pretty good" and "10 steps at 90% is a 35% system" — is where most AI projects go wrong. Not through bad intentions, not lack of effort, not even bad engineering. The math just wasn't in the budget.

Three Layers of Reliability

The March of Nines requires three layers working together. Each layer addresses a different reliability ceiling.

Layer 1: Agent Skills (~90% reliability). Prompts, tool use, model intelligence. This is what most teams focus on. Better prompts, better models, more context. It works — up to a point. The point is approximately 90%. After that, the model's probabilistic nature means that roughly one in ten outputs will be wrong in some way. Not always catastrophically wrong. Sometimes subtly wrong — a classification that's one tier off, a summary that omits a key detail, a recommendation that's defensible but not optimal.

You can push skills past 90% with extraordinary prompt engineering. But you can't push them to 99% reliably, because the model's behavior is non-deterministic. The same input won't always produce the same output. You're playing a probability game, and probability has a ceiling.

The instinct is to solve this by adding more agents. If one agent gets it wrong, have another check its work. If that one gets it wrong, add a third. This feels rigorous. It isn't. You're stacking noisy channels, not adding error correction. Unless those reviewing agents run deterministic validation rules, you're compounding the probability problem — not solving it.

Layer 2: Harness Engineering (~99%+ reliability). Everything that must happen every time — codified in software, not trusted to the model. Linting after every file change. Type-checking before every commit. Deterministic validation rules comparing reasoning to output. Fixed execution sequences that don't vary based on the model's decisions. Git operations handled by scripts, not by the agent.

The harness doesn't make the agent smarter. It makes the system more reliable by removing the model from decisions it shouldn't make. If formatting must happen every time, don't prompt the model to format — run a formatter. If validation must happen before deployment, don't ask the model to validate — run a validation script. Every deterministic step you extract from the agent and codify in the harness pushes the system's reliability ceiling higher.

Layer 2 is where most teams have the biggest gap. They have skills. They have the beginning of evaluation. But the harness — the deterministic, always-on infrastructure that enforces correct behavior regardless of what the model decides — they build ad hoc, if at all. The result: a system that runs 90% reliable in the hands of the engineer who built it and 70% reliable in the hands of everyone else, because the harness lives in the engineer's head, not in the codebase.

Layer 3: Evaluation Architecture (catches the rest). Continuous monitoring that detects when the system drifts, degrades, or encounters conditions it wasn't designed for. The four-layer evaluation stack: progressive autonomy routing decisions by confidence and stakes, deterministic validators catching reasoning-output disconnects, an LLM-as-judge sampling outputs for quality, and factorial stress testing exposing hidden biases before they reach users.

The evaluation layer doesn't prevent failures. It catches them — quickly enough that they don't reach users, or at least quickly enough that the damage is contained. Without it, you learn about failures from users, from support tickets, from the downstream systems that process the output. That feedback loop is too slow and too expensive for anything operating at scale.

Why This Math Matters

The intuition that "90% is pretty good" is the most dangerous misconception in agent development.

90% accuracy on a single step is fine. A copywriting agent that produces good output nine times out of ten is useful — you review the output, fix the occasional miss, and move on. But the moment you chain steps together — and every real-world system chains steps — 90% compounds into something far worse than your intuition predicts.

Steps in pipeline	90% per step	99% per step	99.9% per step
3	73%	97%	99.7%
5	59%	95%	99.5%
10	35%	90%	99.0%
20	12%	82%	98.0%

A twenty-step pipeline at 90% per step succeeds 12% of the time. That's not a system — that's a lottery. And twenty steps isn't unusual for a real-world agent workflow: intake → research → classify → analyze → draft → validate → revise → format → approve → deliver, with several sub-steps at each stage.

Claude Shannon formalized this in 1948 with information theory. Every transmission through a noisy channel loses signal. The entropy compounds. The only way to maintain fidelity across a long channel is to add redundancy — error correction, checksums, verification. Shannon didn't know about AI agents, but his math describes exactly why they fail: each step is a noisy channel, and without redundancy (the harness), the signal degrades with every transition.

The Debugging Trap

Compound failures create a specific kind of debugging hell that teams don't anticipate until they're in it.

When a single-step system fails, you can usually reproduce the failure. Same input, same output. You adjust the prompt, run it again, verify the fix. The feedback loop is tight.

When a ten-step pipeline fails, reproduction is non-deterministic. The same input might succeed the next six runs and fail on the seventh. The failure might appear at step three on one run and step seven on another. The output that triggers the failure might be subtly different each time — plausible enough that the downstream step doesn't explicitly error, just produces slightly degraded output that degrades further as it travels.

This is the compound failure debugging trap: the failures are invisible until they've accumulated, non-reproducible when you try to isolate them, and located somewhere in the pipeline that isn't where the output degradation becomes visible. Teams in this trap do one of two things. They start adding manual checkpoints at every step — "let's add human review after step 4 and step 7" — which defeats the purpose of automation. Or they start hardening individual steps with more elaborate prompts, which addresses the symptom while missing the cause.

The cause is always the same: an agent making a decision that should be deterministic. A classification that should have a validation rule. A formatting step that should be a formatter. A branch point that should have a guardrail. Once you extract those decisions from the agent and codify them in the harness, the pipeline stabilizes.

The debugging trap is how teams discover they need Layer 2. They would have been better off designing it in from the start.

Why Teams Build Only Layer 1

If the three-layer architecture is so obviously necessary, why do almost all teams build only the first layer?

The answer is not ignorance. Most of the engineers who build single-layer AI systems are technically competent people who understand probability and system design. The answer is incentive structure and timeline mismatch.

Layer 1 produces immediate, visible results. You prompt an agent, it produces useful output, you show it to stakeholders, they're impressed. The feedback loop is short and positive. Layer 2 and Layer 3 don't produce immediate results — they prevent future problems. Preventing problems is invisible work. Nobody gets a standing ovation for the incident that didn't happen.

The demo culture amplifies this. AI capability is demonstrated through demos, and demos run short pipelines with curated inputs. The three-step demo at 90% produces a 73% success rate — high enough that the occasional failure can be explained away as a rough edge, not a fundamental architectural problem. The demo creates the impression that the system works. The team ships the three-step demo behavior as though it will generalize to ten-step production behavior. It doesn't.

There's also a sunk cost dynamic. By the time teams discover they need Layer 2 and Layer 3, they've built significant Layer 1 infrastructure. The skills are working, roughly. The agents are deployed, approximately. Rearchitecting around a harness means acknowledging that the original architecture was insufficient — which means acknowledging that the original investment produced a system that needs to be rebuilt, not just extended. That's a hard conversation. Teams often delay it, patching individual skills instead of extracting the deterministic decisions into the harness, until the accumulated failures make the rebuild unavoidable.

The VZYN story isn't unusual. It's the common pattern. The unusual thing is rebuilding before the failures become catastrophic rather than after.

The VZYN Story

I built an AI-powered marketing intelligence platform called VZYN Labs. I didn't know the compound failure math yet — but I understood that agents needed specialization. So the first architecture was what I thought sophisticated AI product engineering looked like: fifteen specialized agents, each responsible for a domain — one for research, one for SEO, one for content strategy, one for analytics, one for competitive intelligence. They passed context between each other. They had complex coordination protocols. They felt capable.

They also composed into a failure factory.

The pipeline I built for client pre-audits ran approximately twelve steps end to end: research collection → market analysis → competitor mapping → keyword extraction → technical audit → performance analysis → content gap analysis → strategy synthesis → competitive summary → recommendations → formatting → report assembly. Twelve steps at ~85% per step — the agents were individually capable, well-prompted, and running on good models — produces an end-to-end success rate of around 14%.

I didn't know the math yet. I was experiencing it.

Two things disguised the failure rate. First, I was running short demos, not production pipelines. A five-step demo at 85% gives you 44% — bad, but usually the demo selected for successful runs. Second, when the pipeline failed, it often produced something — a plausible-looking output with a gap or an error embedded inside. Not a crash. A quiet failure.

I realized what should have been obvious the whole time: the agents were making decisions that should have been deterministic. Routing decisions: which agent handles this input? Made by an agent, probabilistically. Formatting decisions: how should this output be structured? Made by an agent, probabilistically. Validation decisions: is this output complete? Made by an agent, probabilistically. Every one of those was a noisy channel. Every one of those should have been code.

The rebuild collapsed fifteen agents into a unified architecture with a catalog of sixty skills and deterministic playbooks. The skills are the agent functions — the probabilistic steps that genuinely require model intelligence. The playbooks are the harness — fixed sequences that don't vary, validation rules that run deterministically, routing decisions made by code rather than by models. End-to-end reliability went from 14% to something approaching 90%.

The VZYN collapse is the March of Nines made visible. Not a story about bad engineering — the original architecture was thoughtful. A story about building the first layer without the second, precisely because I mistook capability for reliability. The compounding math turns capability into chaos at scale.

The Multi-Agent Trap

The most common response to compound failure problems is to add more agents.

If one agent makes mistakes, have a second agent review the output. If two agents disagree, have a third adjudicate. This pattern — the multi-agent review chain — feels like rigorous quality control. It's modeled on human quality assurance processes. It sounds reasonable.

It isn't reasonable. It's just more noisy channels.

Unless the reviewing agents use deterministic validation rules — rules that check specific, verifiable conditions rather than relying on the model's general capability — the review step is probabilistic. A second LLM reading the output of the first LLM and assessing its quality is running a soft evaluation, not a hard one. The second agent might agree with the first agent's wrong answer, because the reasoning the first agent provided sounds convincing. This is the anchoring bias failure mode: an agent whose output is post-processed by a second agent that was also shown the first agent's reasoning will tend toward agreement even when the reasoning is flawed.

The fix for compound failures is not more agents. It's fewer probabilistic steps. Every agent review step you add is another multiplication in the chain. Every deterministic rule you substitute for an agent decision is a multiplication you remove.

There's a deeper problem with multi-agent review architectures: they don't actually reduce risk on the failures that matter most. The failures that cascade through a multi-agent chain are the confident ones — the wrong answers that don't look wrong, the classifications that are plausible but incorrect, the outputs that pass the vibes check. The reviewing agents miss the same cases the first agent missed, for the same reason: the model's probabilistic reasoning produced a coherent but wrong answer.

Deterministic validation catches confident wrong answers precisely because it doesn't evaluate reasoning. It checks outputs against rules. Does this output contain the required fields? Does the classification fall within the allowed set? Does the referenced entity exist in the database? These checks don't care how convincing the model's reasoning was. They check whether the output satisfies the constraint. Confident wrong answers fail deterministic checks. They pass probabilistic review.

This is why the harness, not the review chain, is the solution to compound failure. The harness extracts the deterministic decisions from the probabilistic layer and enforces them in code. The review chain adds more probabilistic layers on top. One of these approaches gets you to 99%. The other keeps you at 85% with more complexity.

The Gap Between Demo and Production

This is why demos impress and production disappoints. A demo runs a short pipeline, often with curated inputs, in a controlled environment. Three steps at 90% gives you 73% — good enough for a meeting room. The audience sees the successful output and extrapolates: imagine this at scale.

But scale means more steps, more varied inputs, more edge cases, and more time for drift. Production means running the pipeline hundreds of times with real data from real users who phrase things in ways your curated demo inputs never did. Production means the model provider updating their system overnight, shifting your 90% to 85% without warning. Production means a context window that fills up on the twelfth run, degrading performance in ways you can't reproduce consistently.

The gap between demo and production isn't a quality problem. It's a math problem. The math says: you need all three layers.

There's another gap that doesn't show up in demos: the interpretation gap. An agent that's 90% reliable on well-formed inputs might drop to 70% on edge cases — the ambiguous requests, the inputs that don't match the patterns the prompts were designed around. In demos, you don't see edge cases. In production, you see nothing but edge cases. Real users ask things you didn't anticipate, in phrasings you didn't test, with context you didn't account for. The harness handles edge cases the same way it handles everything else: deterministically. The model's reliability degrades on edge cases. The harness doesn't.

What Reliability Actually Means

I spent weeks chasing reliability before I had a clean definition of it — and realized I'd been measuring the wrong thing the whole time.

The word "reliable" carries different weight in different contexts. A reliable car starts most mornings. A reliable surgeon operates correctly every time. The word is the same; the standard is not.

In software, reliability has a technical definition inherited from telecommunications: the probability that a system performs its required function under stated conditions for a specified period of time. Reliability is not correctness on any given run. It's the predicted behavior across many runs, in varied conditions, over time.

AI agents introduce a specific reliability challenge that traditional software didn't have: their failure modes are not deterministic. Traditional software fails in predictable ways — a null pointer exception, a timeout, an unhandled edge case. The failure reproduces reliably from the same inputs. You find it, fix it, and it's gone.

AI agents fail probabilistically. The same input that produces a correct output on nine runs produces an incorrect output on the tenth — and the tenth failure might not reproduce when you investigate it. The model's non-determinism means you're not looking for a bug; you're characterizing a distribution. "This step fails roughly 10% of the time" is the correct description, and it means you need systems that handle 10% failure rates gracefully rather than systems that prevent the failure from occurring.

This is a different design challenge than traditional software reliability. Traditional reliability engineering asks: how do we prevent failures? Agent reliability engineering asks: how do we contain failures? The harness contains failures by catching them before they propagate. The evaluation layer contains failures by detecting them before they reach users. The trust tier system contains failures by ensuring the human oversight is proportional to the consequence of a miss.

What reliability means, in practice, for a three-layer agent system:

Reliable specification: The behavioral contract is precise enough that you can predict, before deployment, what the agent will do in the situations it will encounter. Not all situations — that's impossible. But the common situations, the edge cases you can anticipate, and the boundaries of the situations where you can't predict behavior (which become the escalation triggers).

Reliable harness: The deterministic infrastructure behaves identically on every run. No variance. No "usually runs validation." Validation always runs. The harness's reliability is the ceiling for the system's reliability — if the harness is flaky, everything built on it is flaky.

Reliable evaluation: The detection layer has documented coverage. You know what it catches and what it doesn't. You know the false positive rate — how often it flags correct outputs as failures. You know the false negative rate — how often it misses incorrect outputs. These numbers degrade over time as the system encounters new patterns, and managing that degradation is part of operations.

Reliable escalation: When the system can't handle a case, it hands off cleanly. The human receives the context they need to make the decision. The handoff is logged. The decision is recorded and potentially fed back into the system as a future training case. Escalation is not a failure state — it's a designed behavior that's part of the reliability architecture.

None of this ships with Layer 1 alone. Skills get you capability. The full three-layer stack gets you reliability. And in production, at scale, reliability is the only moat that holds.

The Nines Across Trust Tiers

The March of Nines doesn't demand the same reliability target from every system. What it demands scales with consequence.

A marketing copy generator that's 70% reliable on end-to-end runs is irritating — you review more outputs, fix more errors, spend more time. But the consequence of a bad output is a bad draft. A human reads it. The draft doesn't ship.

A prescription medication referral system that's 70% reliable creates three failures per ten patients. The consequence of a bad output may be a patient receiving the wrong information about a medication dosage, drug interaction, or contraindication. In a Tier 4 safety-critical system, the acceptable failure rate approaches zero — not because perfection is achievable, but because the harness and evaluation architecture must catch what the agent misses before it reaches a patient.

Trust tiers aren't a labeling system — they're a reliability requirement. Tier 1 (low stakes, easily reversible) can operate closer to Layer 1. Tier 4 (safety-critical, irreversible) requires all three layers, and requires each layer to be deeply engineered rather than loosely assembled.

Trust Tier	Example	Acceptable End-to-End Failure Rate	Layers Required
Tier 1	Marketing copy	< 30%	Skills only
Tier 2	Customer service response	< 10%	Skills + basic harness
Tier 3	Legal document draft	< 2%	All three layers
Tier 4	Medical referral, financial advice	< 0.1%	All three, deeply engineered

The mistake most teams make is applying Tier 1 infrastructure to Tier 3 or Tier 4 problems. The demo worked. The system felt capable. The early runs looked good. Then the edge cases arrived, the volume scaled, and the compound failures accumulated into something they couldn't debug or contain.

Trust tiers make explicit what the math implies: Layer 2 and Layer 3 aren't overhead. They're risk-adjusted necessity.

Building the Layers in Order

There's a sequencing discipline to the three-layer system that most teams violate.

The temptation is to start with evaluation. Evaluation feels sophisticated. Monitoring dashboards, quality metrics, automated test suites — these are the visible artifacts of engineering rigor. Teams that want to demonstrate they're doing AI seriously often build evaluation infrastructure first.

This is the wrong order. Evaluation without a harness measures unreliable behavior accurately. You'll know exactly how often the system fails, without having the mechanism to prevent it. The evaluation layer is a detection system. Detection is useful only when paired with enforcement.

The right order is:

First: Build the skills. Design the agent capabilities that match your problem. Identify what each step needs to do, what inputs it requires, what outputs it produces. Get each step to ~90% reliability in isolation. Understand where the probabilistic boundaries are — what the model handles well, what it struggles with.

Second: Build the harness. Extract every deterministic decision from the agent layer. Codify them in software. Build the validation rules, the routing logic, the error recovery paths. Connect the skills to each other through deterministic plumbing, not through model-to-model hand-offs. The harness is complete when every step that must behave consistently does behave consistently, regardless of what the model decides.

Third: Build evaluation. Now you have something worth measuring. The harness enforces the baseline. Evaluation catches the cases where the baseline is insufficient — the novel inputs, the edge cases, the model drift. Because the harness is handling the deterministic cases, evaluation can focus on the hard cases: the behavioral scenarios that require judgment to assess, the quality dimensions that resist automated checking.

Teams that skip Layer 2 and go straight to Layer 3 find themselves running sophisticated evaluation on an unreliable system. They optimize the metrics that are easy to measure. They miss the failures that look like successes.

What a Three-Layer System Looks Like

When all three layers work together, the system feels qualitatively different from what most teams build.

The agent skills handle the genuinely ambiguous work — the classification decisions, the synthesis, the generation. The harness handles everything that should always happen the same way — validation, formatting, routing, error recovery. Evaluation runs continuously, sampling outputs against behavioral contracts, catching drift before it propagates.

The practical signature of a three-layer system is this: when something fails, you know why it failed, where it failed, and what to change. The evaluation layer caught it. The harness logged the failure point. The skill can be adjusted in isolation without disturbing the rest of the pipeline.

The practical signature of a one-layer system is the opposite: you know something failed because the output was wrong, but you don't know where it went wrong, the failure doesn't reproduce reliably, and any change you make might fix this failure while creating a new one somewhere else.

There's an operational difference too. In a three-layer system, the humans in the loop are doing high-value work: reviewing the cases the evaluation layer flagged, deciding whether a novel failure pattern represents a spec gap or a model gap, upgrading the harness when the escalation patterns reveal a new deterministic rule to codify. Their attention is directed by the system to where it's needed.

In a one-layer system, the humans in the loop are doing low-value work: reviewing every output to catch the failures the system can't catch for itself. Their attention is consumed by supervision rather than directed by detection. They're the Layer 2 and Layer 3 that the system never built.

There's a compounding advantage here that mirrors the compounding failure. Every deterministic rule you add to the harness is a failure mode that never reaches evaluation or human review. Every evaluation pattern you establish catches a class of failures automatically. The system becomes more reliable over time not despite growth but because of it — each new edge case encountered and handled correctly expands the harness and the evaluation coverage.

The compound failure math runs in reverse for teams that invest in the infrastructure. A team with a 90% harness reliability (still imperfect, but systematically enforced) combined with an evaluation layer that catches 80% of the remaining failures produces end-to-end reliability orders of magnitude higher than skills alone. The layers compound in your favor. The nines accumulate.

Skills expand capability. The harness enforces reliability. Evaluation catches drift. Miss any one, and the March of Nines catches up with you.

In the next chapter: the measurement systems that tell you whether you're making progress — and why the ones you inherited were built for a world that no longer exists.

Yuval Noah Harari, Nexus: A Brief History of Information Networks from the Stone Age to AI (Random House, 2024). ↩
Claude E. Shannon, "A Mathematical Theory of Communication," Bell System Technical Journal 27 (July and October 1948): 379–423, 623–656. ↩

← All chapters