Part Enforcement

Chapter

Simplicity: What Rob Pike Teaches Us About Agents

Rob Pike wrote five rules of programming decades before AI agents existed.¹ They apply with double force now.

Pike co-created Unix, co-invented UTF-8, and designed Go. His rules emerged from decades of building systems that outlasted their original context, worked in conditions their designers hadn't anticipated, and were maintained by people who weren't the original authors. They're not rules for clever code. They're rules for systems that actually work.

I built thirteen agents when one would do. The AI era is stacking up over-engineered, under-tested, premature-optimization traps — and I shipped one of them. Pike's rules are the antidote. Applied to agents, they surface exactly what most teams are getting wrong — and what to ship instead.

Rule 1: You Can't Tell Where a Program Is Going to Spend Its Time

Pike's original: Bottlenecks are surprising. Don't guess — measure. Profilers exist for a reason. Premature optimization based on intuition almost always targets the wrong thing.

Applied to agents: You can't tell where an agent pipeline is going to fail. Don't guess — measure. Most teams optimize their prompts based on intuition about which step is the bottleneck. The actual bottleneck is almost always somewhere else.

The failure point in an agent pipeline is rarely where it looks like it should be. A team building a multi-step document processing system spent three weeks optimizing the extraction prompt — the step they were most worried about — only to discover the actual failure rate was in the formatting step, which nobody had spent any time on. The extraction step was 94% accurate. The formatting step was 61%. The compound failure rate was driven entirely by the unexamined step.

This is Pike's Rule 1 made concrete. The team's intuition said "extraction is hard, optimize extraction." The measurement said "formatting is broken, fix formatting." Not an intuition problem — a measurement problem. The intuition was wrong. The measurement was right.

For agents, the profiler is the evaluation library. Four-layer evaluation — behavioral scenarios, deterministic validation, LLM-as-judge sampling, and factorial stress testing — is the measurement infrastructure. Without it, you're guessing. And your guesses will be wrong in exactly the ways Pike predicted.

The teams that ship fastest in the AI era aren't the ones building the most sophisticated agents. They're the ones that measure first, then optimize. Sophistication comes after measurement points at the real bottleneck.

Rule 2: Measure. Don't Tune for Speed Until You've Measured

Pike's original: Even after measuring, only optimize if a single bottleneck dominates. Micro-optimizing spread-out code is wasted effort.

Applied to agents: Even after measuring, only optimize if a single failure mode dominates. Micro-optimizing spread-out agent prompts is wasted effort.

The evaluation architecture from Chapter 9 produces specific failure data: which scenarios failed, which stress variations produced the most instability, which reasoning-output disconnects appeared in the deterministic validation layer. This data reveals where optimization effort should go.

The mistake is optimizing everything simultaneously — tightening every prompt, reducing every latency, improving every validation rule. Spread-out optimization produces spread-out improvement: 2% better here, 3% better there, compounding to something real but slowly and expensively.

Concentrated optimization — finding the one failure mode that accounts for 40% of pipeline failures and eliminating it — produces step-change improvement. One targeted intervention changes the end-to-end reliability more than ten small ones.

Rule 2 also applies to the harness. Don't add validation rules for failure modes you've never observed. Add them when measurement reveals a real gap. A harness that grows by reaction to observed failures is tight and purposeful. A harness that grows by anticipating hypothetical failures accumulates rules that interact in unexpected ways and slow down sessions without preventing real problems.

Measure. Then optimize. In that order. Always.

Rule 3: Fancy Algorithms Are Slow When n Is Small, and n Is Usually Small

Pike's original: Simple algorithms with low constant factors beat clever O(log n) algorithms when your data set is tiny — which it usually is in practice.

Applied to agents: Simple architectures outperform clever ones when the problem scope is realistic, and the problem scope is almost always realistic.

This is the VZYN Labs lesson in one sentence. Thirteen orchestrated agents with hexagonal architecture — a fancy algorithm for a problem where n was small. One agent with fifty-seven skills — a simple stack that actually worked. The engineers built the fancy version because it was the "right" architecture for enterprise-grade marketing automation at scale. I realized later: we weren't at scale. We were shipping an unvalidated MVP into an unproven market.

The fancy algorithm had a constant factor that dominated the actual problem size: the overhead of coordinating thirteen agents, maintaining context across specializations, debugging failures that propagated across coordination boundaries. That overhead was higher than the work being done.

The VZYN rebuild eliminated the coordination overhead by eliminating the coordination. One agent. Fifty-seven skills routed through deterministic playbooks. The agent's job became simpler, and the system became dramatically more reliable, because the "fancy" part — multi-agent coordination — was the bottleneck. When we removed it, the remaining work was straightforward.

Rule 3 predicts exactly what happened: we built a sophisticated solution for a problem where a simple one would have worked better. The sophisticated solution had a higher constant factor. The simple solution, deployed earlier, would have validated the market assumption faster and cheaper. The fancy algorithm was expensive when n was small. And n was small.

The pattern repeats constantly in AI development. Teams build multi-agent orchestration frameworks before proving multiple agents are needed. They design dynamic routing layers before the routing patterns have been observed. They ship self-improving loops before the base behavior is reliable. Each one is a fancy algorithm deployed to a small n. Each one breaks, eventually, when the overhead exceeds the value.

Right-size to the problem. One agent when one suffices. Multiple agents when the domain genuinely requires specialization — when the complexity is intrinsic to the problem, not imposed by the architecture.

Rule 4: Fancy Algorithms Are Buggier Than Simple Ones

Pike's original: Complexity is a liability. A simple algorithm you can reason about and debug in ten minutes is almost always preferable to a brilliant one that takes days to get right.

Applied to agents: Complex agent architectures are harder to debug than simple ones. The bugs compound, and the bugs are the kind that look like successes.

Traditional software bugs produce errors: exceptions, null pointers, failed assertions. They're visible. They break things in obvious ways. An agent system's bugs are different: they produce plausible-looking wrong outputs. The system doesn't crash — it succeeds at the wrong thing. The formatting agent produces content that follows the format but subtly violates the brand guidelines. The classification agent categorizes correctly 94% of the time and differently on the 6% that matter most.

These bugs are hard to find in simple architectures. In complex ones, they're nearly impossible. When a thirteen-agent pipeline produces a bad output, the investigation requires understanding what each agent received, what it produced, how it was interpreted by the next agent, and where in the chain the original correct intent diverged from the final wrong output. The debugging surface is proportional to the architecture's complexity.

The hexagonal architecture in the VZYN original build was structurally correct. Every layer respected its boundaries. Every adapter mapped its port correctly. The code was, in a traditional sense, well-engineered. And when something was wrong — when the marketing strategy agent produced a recommendation that was technically valid but commercially irrelevant — it was almost impossible to trace why. The wrong answer had come through layers of abstraction that each transformed the reasoning without recording what had changed.

Simple architectures fail visibly. The failure is in one place, caused by one thing, fixable by one change. You can hold the whole system in your head. You can follow the path from input to output without losing the thread.

Pike's Rule 4 applied to agents: prefer the architecture you can debug in an hour over the architecture that might scale to ten million users someday. Your first bottleneck is reliability, not scale. Reliability requires debuggability. Debuggability requires simplicity.

Rule 5: Data Structures, Not Algorithms

Pike's original: Choose the right data structure and the algorithm often becomes trivially simple. "Show me your data structures and I'll show you your code."

Applied to agents: Choose the right intent structures and the agent's behavior often becomes correct without elaborate prompting. "Show me your spec and I'll show you your agent's output."

This is the principle that connects the whole book. Pike was talking about data — the structure of information storage that makes algorithmic logic simple or complex. The same principle applies to intent: if you've chosen the right intent structures and organized your specification well, the agent's behavior will almost always be correct.

Intent structures are the spec, the intent contract, the CLAUDE.md, the hard boundaries. These are the "data structures" of agent development — the organized information that the agent's behavior is computed from. Get them right, and the behavior becomes predictable. Get them wrong, and no amount of prompt engineering will compensate.

The teams that invest heavily in prompt engineering and lightly in specification are building algorithms without data structures. They're tuning the computation without organizing the input. The prompts get increasingly baroque: longer, more qualified, full of special cases and exceptions that pile up as the model continues to produce wrong answers. The real problem is that the data structure — the spec — was wrong. No algorithmic sophistication can fix a data structure problem.

The corollary is just as important: if your spec is precise, your prompts can be simple. The agent has the information it needs to make correct decisions. It doesn't need elaborate instructions for every edge case because the edge cases are resolved in the spec. The prompt is the invocation. The spec is the substance.

Nate Jones, who helped me see this clearly, articulated the evolution from prompt engineering to context engineering to intent engineering. Each step moves closer to Pike's Rule 5 — each step is about organizing the information the agent works from, not crafting increasingly clever instructions. Intent engineering is Rule 5 for agents. Get the intent structures right and the agent's behavior writes itself.

Three Pillars, Two Failure Modes

Pike's rules for agents organize around three pillars:

Context (input): What the agent knows. The specification, the discovery document, the intent contract, the codebase. Getting context right is Pike's Rule 5 — the right data structures make the algorithm self-evident. The right context makes the agent's output predictable.

Agents don't absorb context through experience. They have what's in their context window. The quality of what's in that window is the ceiling on the quality of the output. An agent given a vague, incomplete, or stale context will produce outputs that are correspondingly vague, incomplete, or wrong. The investment is in the context — the documentation written for the agent, not for humans.

Constraints (guardrails): What the agent can't do. The harness, the tool permissions, the branch isolation, the validation loops. Constraints aren't limitations — they're the boundaries that make creative freedom productive. An agent without constraints is an agent that reformats your production database because it seemed helpful.

Pike's linting insight applies here: linting rules are machine-enforceable specifications. Every convention that can be enforced deterministically should be enforced deterministically. Don't put it in the prompt ("always use TypeScript with strict mode") if you can put it in a lint rule that fails the build. The lint rule is enforced on every run. The prompt instruction is followed most of the time.

Coordination (execution): How agents work together. Simple queues beat fancy protocols. A fixed pipeline with gated transitions beats a dynamic orchestration framework that adapts on the fly. Pike's Rule 3 again: the fancy coordination was slower, harder to debug, and more fragile than the simple version.

The two human failure modes that Pike's rules catch:

Premature optimization. Building the scalable, enterprise-grade, horizontally-distributed version before you know if anyone wants it. The VZYN Labs engineers optimized for scale on an unvalidated concept — that was me signing off on it. The simple rebuild validated faster than the complex original ever could.

The premature optimization failure mode is particularly dangerous in the AI era because the tools make complex architecture accessible. You can spin up a multi-agent system in an afternoon. The question isn't whether you can — it's whether you should. The answer is almost always: not yet. Build the simple version first. Measure it. Optimize what measurement says needs optimization.

Specification fatigue. The inverse problem. The methodology works, so you apply it to everything at maximum depth. Every Tier 1 internal tool gets a thirty-page spec with factorial stress testing. The team burns out on documentation. Someone says "can we just skip the spec for this one?" And the methodology breaks.

Specification fatigue is real and underappreciated. Writing tight specifications is cognitively expensive work. It requires deep domain expertise, precise language, and the willingness to make explicit decisions about ambiguous situations. The people who do this well — who can write a behavioral contract that an agent can implement without asking clarifying questions — are rare and they get tired.

The defense against both is the same: right-size to the problem. Tier 1 gets minimal spec. Tier 4 gets maximum spec. One agent when one suffices. Multiple agents when the domain genuinely requires specialization. Measure the problem, then build the solution. Not the other way around.

Right-Sizing in Practice

The "right-size to the problem" principle is easy to state and hard to implement. What does it actually look like?

At Tier 1, a single agent with inline prompts and no formal harness is often sufficient. A script that formats data, a tool that drafts social media posts, a helper that summarizes meeting notes. The spec is a few paragraphs. The test is "does it do what I expect?" The harness is "I'll review it before using it." This is correct Tier 1 practice. A thirty-page spec for a CSV formatter is not rigor — it's specification fatigue.

At Tier 2, the harness begins in earnest. A formal CLAUDE.md. Linting enforced. A session log. Seven behavioral scenarios with two stress variations each. An intent contract for the main goal. This takes a few hours to set up properly. The overhead is real and proportional: a Tier 2 system that fails costs more than a Tier 1 system that fails. The few hours of setup pays back in reduced debugging time when something goes wrong.

At Tier 3, the spec depth increases substantially. Every edge case the domain expert can think of is encoded. The intent contract has a full cascade of specificity for each goal. Factorial stress testing runs before each deployment. Human review is in the loop for consequential outputs. The overhead is days of setup — and proportional to a system where failures have financial or legal consequences.

At Tier 4, the overhead is weeks. The spec covers every behavior the system will exhibit in normal and abnormal conditions. The intent contract is reviewed by domain experts. The evaluation library is built from real domain cases. Shadow mode runs before supervised mode. Every gate requires sign-off from multiple parties. For LNG operations or prescription medication referrals, that overhead is correct — the cost of a failure, measured in physical or legal harm, dwarfs any setup cost.

Right-sizing means resisting the pull in both directions. The pull toward over-engineering — applying Tier 4 rigor to a Tier 1 problem because "we should do this properly" — wastes time and produces specification fatigue that compromises future projects where the rigor is genuinely needed. The pull toward under-engineering — applying Tier 1 informality to a Tier 3 problem because "it's just an internal tool" — creates systems that fail in ways their builders didn't anticipate and their users didn't expect.

The tier question at intake exists precisely to calibrate the right-sizing. Answer it honestly, and the amount of rigor required at each subsequent phase becomes clear. Skip it, and every phase defaults to either too much or too little, based on the team's intuitions about what "feels right."

Specification Fatigue: The Real Bottleneck

Specification fatigue deserves more than a line item. It's the hidden bottleneck that limits how many projects a team can run with good methodology.

Writing a precise behavioral specification is hard. Not technically hard — cognitively hard. It requires holding the system's complete intended behavior in your head while translating it into precise language that leaves no room for misinterpretation. It requires making explicit decisions about every ambiguity, including the ambiguities you haven't noticed yet. It requires the domain expertise to know what matters and the language precision to capture it.

This kind of work has a daily cap. A good spec architect can produce maybe four to six hours of genuine specification quality per day before the precision starts to degrade. The decisions become less careful. The edge cases start getting resolved with "it should handle this appropriately" rather than explicit behavioral requirements. The spec gets longer but less rigorous.

Teams that run multiple concurrent Tier 3 and Tier 4 projects without the spec capacity to staff them properly end up with some projects at full rigor and others at apparent rigor — specifications that look thorough but were written in cognitive surplus, after the genuine attention was spent on other things. These are the projects that produce VZYN-style failures: the spec looked fine, the build followed it, and the system that emerged did something nobody intended.

The practical response to specification fatigue is structural: separate the roles. The spec architect who writes specifications should not also be the developer who implements them. If the same person is doing both — writing the spec in the morning and implementing it in the afternoon — their attention is split, and the spec quality suffers. The Dark Factory methodology makes this separation explicit: the spec architect, the agent (executor), and the validator are three distinct roles. The Spec Architect owns the spec. They're not also running the build.

For smaller teams where one person wears multiple hats, the response is sequencing: finish the spec before starting the build. Don't context-switch between specification and implementation within the same project. The spec must be complete — ready for a contractor to implement without asking questions — before the build phase begins. This discipline preserves specification quality even when headcount is limited.

The Discipline Behind Simplicity

Simplicity in agent systems is not easy. It's a discipline — one that requires active effort to maintain against constant pressure toward complexity.

The pressure toward complexity is structural. Every new use case suggests a new agent. Every new requirement suggests a new capability. Every new failure mode suggests a new safeguard. The architecture grows naturally, organically, with each addition locally justified. The result — a system that nobody designed to be complex but that became complex through accumulated justified additions — is exactly what Pike was warning against.

The discipline of simplicity requires asking, for every proposed addition: does this make the system more capable of serving the stated goal, or does it make the system more sophisticated for its own sake? These are different. A new agent that genuinely handles a use case no existing agent can handle is a capability addition. A new agent that handles a use case differently from how an existing agent handles it — because the new agent was trained differently, or has a different prompt, or uses a different tool — is complexity without capability gain.

The same discipline applies to specs. Every new requirement section should answer the question: does this describe a behavior that matters, or does this describe a behavior that feels thorough? Thoroughness and precision are different. A thorough spec that covers everything is a spec that covers some important things and many unimportant ones. A precise spec that covers what matters is a spec an agent can implement correctly.

Pike built simple systems that lasted decades. The Go programming language, which he co-designed, has remained simple while languages designed to be expressive have grown baroque. The simplicity wasn't an accident — it was actively defended against the pull of every feature request that seemed reasonable but would have added complexity faster than it added value.

That same active defense is what makes agent systems reliable over time. Not the absence of features, but the discipline to add only what the problem requires and nothing more.

The Simplicity Test

How do you know when a system is too complex for agent-assisted development?

The test has three questions:

1. Can an agent, given only the file structure and a task description, identify where to make the change? If the answer requires understanding two or more abstraction layers first, the architecture is too complex for reliable agent navigation. Not impossible — but the discovery document must compensate with explicit navigation maps, and the build sessions will be slower and more error-prone than they need to be.

2. Can a human read the spec and verify in ten minutes whether an agent's output is correct? If verification requires deep knowledge of the codebase, the testing approach, and the conventions — knowledge that isn't in the spec — then the spec is incomplete. The spec must encode enough context that correctness is verifiable by someone with the spec, not just someone who has lived with the codebase.

3. Can you explain the architecture to a new contributor in a twenty-minute conversation? This is Pike's Rule 4 applied as a social test. If the explanation requires an hour, the architecture is complex enough that bugs will be hard to find and fixes will be hard to make without introducing new problems.

Failing any of these tests isn't disqualifying. Some problems are genuinely complex. But failing them is a signal to compensate: a richer discovery document, more frequent build stalls with human intervention, smaller task decompositions, and more careful evaluation coverage. Complexity in the problem must be matched by rigor in the process.

Linting as Machine-Enforceable Specifications

One of the most practical applications of Pike's rules to agent development is the linting insight: lint rules are machine-enforceable specifications.

Every convention that can be enforced deterministically should be enforced deterministically. Not in the prompt. Not in the CLAUDE.md. In a lint rule that fails the build.

"Always use TypeScript with strict mode" is a convention. You can put it in the CLAUDE.md — the agent will follow it most of the time. Or you can put it in your TypeScript config and your ESLint rules — and the agent will follow it every time, because non-compliance breaks the build.

"Never use any except in documented exceptions" is a convention. It can be a CLAUDE.md instruction, followed probabilistically. Or it can be an ESLint rule (@typescript-eslint/no-explicit-any), enforced deterministically. The lint rule is always right. The CLAUDE.md instruction is right most of the time.

"Database queries must go through the query layer, not direct Supabase calls from components" is a convention that can be enforced architecturally — put the Supabase client only in the query layer — or through lint rules that flag direct imports of the Supabase client in component files. Either approach is more reliable than a CLAUDE.md instruction the agent will follow until it doesn't.

This principle has a corollary that most teams miss: when you find yourself writing a CLAUDE.md instruction because the agent keeps making the same mistake, the first question to ask is "can this be a lint rule?" If yes, make it a lint rule. Remove the CLAUDE.md instruction. Now the constraint is enforced at the harness layer, not the instruction layer. The harness is more reliable than the prompt.

The practical result is a CLAUDE.md that shrinks over time as conventions migrate from instructions to lint rules. The conventions that remain in CLAUDE.md are the ones that genuinely can't be mechanically enforced — domain knowledge, architecture decisions, project-specific context that requires human judgment to encode but machine enforcement to apply. Those should stay as instructions. Everything that can be a lint rule should be a lint rule.

Pike's insight was that data structures, not algorithms, determine code quality. The linting equivalent is: machine-enforced constraints, not agent instructions, determine convention reliability.

The Agent Environment as Product

There's one more Pike-derived principle that deserves its own space: the agent's environment is as much a product as the agent's output.

When developers think about AI agents, they think about prompts — what to tell the agent, how to phrase it, how to get better output. But Pike's Rule 5 suggests that the environment matters more than the instruction. A well-structured codebase produces better agent output than a perfectly crafted prompt in a messy codebase. A clean .factory/ directory with a current spec produces better results than an elaborate prompt with no context.

The environment is the product. Build the environment right, and the agent performs right. This is why the harness exists — not to constrain the agent's intelligence, but to create the conditions where that intelligence produces reliable results.

Documentation written for agents is different from documentation written for humans. Human documentation explains reasoning: "We chose this pattern because of X, and the tradeoff was Y." Agent documentation encodes behavior: "Files in this directory follow this naming convention. Changes to this module require this validation sequence. The hard boundaries are these, in priority order." The purpose is different. The format is different. The investment in writing it is the same — and it's real work that most teams don't do.

The single sentence from Rob Pike's rules applied to this: agents amplify whatever environment they operate in — good or bad. A well-designed environment with precise specs, clean constraints, and clear coordination produces disproportionately better agent output. A poorly-designed environment with vague specs, implicit constraints, and tangled coordination produces proportionally worse output.

You get to choose. Build the environment deliberately, or let it form by accident. That choice decides whether agent speed is an accelerant for work or an accelerant for mistakes.

Simple environments. Simple coordination. Simple specifications at the right depth for the tier. Pike would approve.

Pike's Rules and the Bitter Lesson

Richard Sutton's Bitter Lesson argued that general methods that scale with computation always beat domain-specific clever methods, given enough time. Human intelligence encoded in systems doesn't compound. Raw computational power does.

Pike's rules are not in tension with this lesson. They're its complement.

The Bitter Lesson says: don't bet on human cleverness in the model. Models trained on more data with more compute will outperform models trained on clever human-designed features. Don't encode your intelligence into the model architecture — let the model learn it from scale.

Pike's rules say: don't bet on human cleverness in the system either. Simple architectures that scale with human discipline will outperform complex architectures that require brilliance to maintain. Don't encode your intelligence into the system architecture — let the spec and intent contracts encode it.

The combination is the design philosophy of the Dark Factory: lean models with rich specs and lean architectures with strong harnesses. Not clever agents. Not clever orchestration. Reliable rails that clear thinking agents can run on.

As models get stronger — and they will, at roughly the pace METR documents — the rails can simplify. The compensations in the CLAUDE.md become unnecessary as models improve. The elaborate multi-step validation chains become unnecessary as models become more reliable. The clever orchestration protocols become unnecessary as models become better at navigating simple coordination.

What doesn't simplify: the specs. The intent contracts. The hard boundaries. The evaluation libraries. The domain knowledge encoded in the cascade of specificity. These are the data structures Pike was talking about — the organized information that makes the algorithm trivially correct. Models getting smarter makes the algorithms better. Only human investment in the data structures makes the algorithms correct.

Simple architectures. Strong intent structures. Measurement before optimization. Let the data do the work.

That's what Pike would say. And he'd be right.

Next, we stop theorizing and look at the receipts.

Part III is complete. In Part IV, we look at real projects — what shipped, what broke, and what I'd do differently.

Rob Pike, "Notes on Programming in C" (1989). The five rules originated in this internal Bell Labs document. Reprinted in Brian W. Kernighan and Rob Pike, The Practice of Programming (Addison-Wesley, 1999). The formulation "Data dominates. If you've chosen the right data structures and organized things well, the algorithms will almost always be self-evident" is Rule 5. ↩

← All chapters