Part In The Wild

Chapter

Four Factories: The Methodology in the Wild

I'm running four projects concurrently. Different domains. Different industries. Different countries. Different stakes. The same methodology.

The projects are at different stages. A few have shipped. A few are still in build. I don't know which of them will outlast the others. What I know is what the methodology looks like in practice — on real problems, with real constraints, for real organizations — and that's what this chapter shows.

I'll admit the obvious: this isn't a clean set of production case studies with three-year performance data. Some of these systems have now shipped and are running with real users; others are still mid-build. But the design decisions, the intake conversations, the spec constraints, the harness choices — those are real. They reveal something theory alone can't: what happens when the methodology meets the friction of specific organizations, specific domains, and specific humans with specific fears and specific knowledge.

Factory 1: Regasificadora del Pacífico — Tier 4, Safety-Critical Infrastructure

The operation: International LNG tankers arrive at Buenaventura Bay on Colombia's Pacific coast. Seventy-two cryogenic containers are loaded onto a barge. The barge moves them to land transport. Trucks carry them to the regasification plant in Buga. The gas feeds thermal power plants across southwestern Colombia. A $42 million operation. A five-year contract with Ecopetrol.

The intake: I asked the question. Oscar described the operation for fifteen minutes — the tankers, the cryogenic handling, the Ecopetrol relationship, the investor presentations to international capital. I asked: "What's the worst realistic outcome if the AI system gets it wrong?" He didn't need to think about it. Wrong guidance on LNG handling: physical harm. Wrong analysis in an investor presentation: financial catastrophe. Wrong compliance information in an Ecopetrol submission: voided contract, sector consequences.

Tier 4. Five minutes. The rest of the intake was about scope.

The three modules: The project covers three distinct deliverables under one Tier 4 classification.

Module 1 is an industrial idea validator — a system that takes a business concept (described in a voice note or document), launches research across technical, market, regulatory, and competitive dimensions, and produces an interactive dashboard answering: Is it viable? What does it cost? What are the risks? The first use case: evaluating a data center adjacent to the plant, exploiting the residual cold at -162°C. The harness here is a research pipeline with deterministic output structure — the dashboard format is fixed, the research categories are fixed, the validation rules check that every claim in the dashboard is sourced.

Module 2 is sales intelligence — website content and investor materials for thermal plant expansion across southwestern Colombia. Tier 4 by domain association, not by consequence: a wrong sentence in a sales deck isn't safety-critical, but the client's context keeps everything under the same umbrella. Human review is mandatory before any external-facing output ships.

Module 3 is the hardest: five operational and safety manuals synthesized from 227 source documents, with a sixty-day deadline, for the Ecopetrol contract. The harness requirement is absolute — every statement must be traceable to a source document, and no model-generated synthesis is permitted where direct quotation is possible. This is the opposite of RAG: not "find relevant chunks and synthesize," but "find the specific passage that says this, quote it, and cite it." Hallucination is not a quality concern here. It's a safety concern.

What the harness looks like: The communication chain is itself a harness. Oscar → Diego (AI Adoption Director) → Jhonn. One point of contact with the client. Deliverables flow Jhonn → Diego → Oscar. Diego validates technical output against domain reality before it reaches Oscar. Oscar validates strategic alignment before it reaches external stakeholders. Two human checkpoints — not in software, but in the organizational structure of the project — built into every deliverable path.

Update (2026-04-15): RDP is still Phase 1. Module 3 — the Ecopetrol operational manuals — is the live workstream, still Tier 4, still safety-critical, still on the clock. Nothing about the tier has changed, and nothing about the discipline has slipped.

What we're watching: Source document quality will determine Module 3's ceiling. Two hundred and twenty-seven documents of varying quality, age, and internal consistency. The spec includes a document quality assessment phase before synthesis begins, but the sixty-day deadline creates pressure to compress it. We will not compress it. A synthesis built on contradictory source documents is worse than no synthesis.

Factory 2: Ecomm Knowledge Operating System — Tier 4, Patient Safety

The operation: A call center handles prescription medication referrals for a pharmacy. Customer service representatives receive calls from customers with questions about medications — dosing, interactions, contraindications, refill procedures. The representatives use a knowledge base of 500+ SOPs to find answers. The SOPs are structured consistently (trigger → steps → exceptions → escalation) but the search is keyword-based — representatives know it as a Ctrl+F system.

The intake: Less obvious than Regasificadora, but Tier 4 by the same logic. A customer asks about taking ibuprofen with warfarin. The representative looks it up. The knowledge base surfaces a result. The representative follows it. If the knowledge base is wrong — if the AI system surfaces the wrong SOP, or misrepresents a drug interaction — the representative gives wrong clinical guidance to a patient.

One wrong answer about a drug interaction. One patient who follows the guidance. The realistic worst case is harm. Tier 4.

The key design decision: Every instinct in AI-assisted knowledge management points toward RAG — a vector database, semantic search, chunk retrieval, language model synthesis. The decision to use Postgres with pgvector instead of a general RAG architecture was a deliberate choice with a specific rationale: structured data needs precision, not creativity.

The SOPs are already structured. They have defined sections. They have explicit trigger conditions. The classification that matters — "which SOP applies to this situation?" — is a matching problem, not a generation problem. A deterministic retrieval system that finds the right SOP reliably is better than a generative system that synthesizes an answer from multiple SOPs, because the synthesis introduces a generation step where the model can produce a plausible-but-wrong answer about a medication.

The harness enforces the architecture: the system surfaces the SOP, the representative reads the SOP, the representative makes the decision. The AI is a search and retrieval layer. It is not a decision-making layer. That constraint is written into the CLAUDE.md, into the product spec, and into every piece of customer-facing UI copy.

The brownfield reality: The wiki is the biggest asset and the biggest constraint. Fifteen years of accumulated SOPs, maintained by a QA team of two people working five hours per week. The SOPs are consistently structured, but some are outdated, some are inconsistent with current policy, and some have informal addenda that live outside the official SOP in the heads of experienced representatives.

Discovery came before specification. The discovery phase mapped every layer of the existing knowledge infrastructure: the official SOPs, the informal workarounds, the escalation pathways, the QA review process, the three teams (CSR, QC, Loyalty) and what each team knows that the others don't. The discovery document is what made it possible to write a spec that augments the existing system rather than replacing it.

Update: shipped (2026-04-15): The SOP rewriter side of the Ecomm KOS is in production. The standalone agentic tool that rewrites SOPs into a visual standard — Playwright scraping the wiki, Playwright automating the ERP for annotated screenshots, QC reviewing exported HTML — is running. That's the leading edge of the KOS build; the retrieval layer is next. Everything below about the Loyalty team, the escalation pathway, and the wiki-gap dynamic is still what we're watching, now with a production rewriter feeding the knowledge base.

What we're watching: The Loyalty team is a human patch for wiki gaps — they exist because the SOPs don't cover enough edge cases. A better search layer may actually expose the wiki gaps more visibly, not cover them. The system might make the Loyalty team more necessary, not less, in the short term. The spec accounts for this: the escalation pathway from AI-assisted search to human expert is a first-class feature, not an edge case.

Factory 3: Edifica — Tier 3-4, Legal Compliance

The operation: Propiedad horizontal is Colombia's legal framework for residential building governance under Ley 675 de 2001. Building administrators — administradores — manage financial transparency, assembly governance, maintenance coordination, and resident communication for buildings with 60+ units. Most of them manage three to eight buildings simultaneously, working from spreadsheets and WhatsApp.

The intake: The boundary between Tier 3 and Tier 4 required the most judgment of any project I've run. Nobody dies if the system makes a governance error. But the consequences are legal: a miscalculated quorum invalidates a property owners' assembly. A financial report that doesn't comply with Ley 675 transparency requirements exposes the administrator to regulatory action. A wrong convocatoria notification with the wrong deadline makes the assembly legally invalid.

These are financial and legal consequences, not safety consequences. Tier 3, with specific modules that touch the governance mechanics pushed to Tier 4.

The tier boundary in practice: Different modules carry different tiers within the same project. The resident directory and communication module is Tier 3 — errors are recoverable, consequences are operational. The assembly governance module (quorum calculation, convocatoria generation, acta validation) is Tier 4 — errors have legal consequences that can't be undone by fixing them after the fact. A legally invalid assembly that already voted on a major budget item cannot simply be re-done without significant legal process.

The CLAUDE.md hard boundaries encode this directly: "Never open an asamblea session without verified quorum. Never publish an acta without explicit human review and approval. Never allow an owner to be both physically present AND represented by poder in the same asamblea." These aren't soft guidelines. They're constraints the system refuses to violate.

The intent contract: The privacy-versus-transparency conflict is the central organizational tension in building management. Residents want their personal information private. Administrators need access to contact information, payment status, and ownership records to fulfill their legal obligations under Ley 675.

Without an intent contract, the system would decide this conflict inconsistently — favoring privacy in some contexts, transparency in others, based on model priors. The intent contract resolves it once: "When resident privacy conflicts with administrative transparency, transparency wins. The building's legal obligations under Ley 675 take precedence over individual preferences." Every ambiguous governance decision resolves the same way, across every session, without requiring the administrator to re-explain the trade-off.

Update: shipped (2026-04-15): Edifica is production-ready and deployed. I realized somewhere in the build that the technology risk had never been the real risk — the adoption risk always was. From here out, every change is brownfield: delta specs against a live system with real administrators, not greenfield design against a whiteboard.

What we're watching: The sales cycle for building management software in Colombia comes down to demonstrating to administradores juggling three buildings simultaneously that the system actually saves them time. Hernan (the mid-level developer building Edifica with me) has the right instinct: the first administrator who runs a full assembly cycle — convocatoria generation through quorum verification through acta production — without a compliance error is the first testimonial. That's the thing we're building toward.

Factory 3.5: Declara IA — Tier 3, Tax Filing (A Note on Speed)

One project not in the original four deserves mention because it illustrates something the others don't: what the methodology looks like when the builder has domain expertise and technical help, and the problem is genuinely bounded.

Joen Anaya Ortega is building Declara IA — a Colombian tax filing web application for salaried workers. The problem it solves: DIAN (Colombia's tax authority) requires workers earning above a threshold to file a Formulario 220 and pay or receive a balance via Formulario 221. The calculation is deterministic — it's math specified by regulation — but the process is opaque enough that most workers either hire a contador or don't file at all.

The intake for Declara IA took about twenty minutes. Worst realistic outcome: a wrong tax calculation produces a wrong Formulario 221. The user files it. They either underpay (DIAN will eventually collect, with penalties) or overpay (they leave money on the table). Financial consequence, not safety consequence. Tier 3.

The spec was faster than any other project in this chapter — because Joen has the domain knowledge and the technical background simultaneously. He doesn't need an interpreter between the domain and the specification. The behavioral contract for the core calculation module was written in one session: PDF input → Gemini table extraction → deterministic calculation pipeline → Formulario 221 output. The deterministic step is the key insight: the tax calculation itself is a formula specified by DIAN. It doesn't require a language model. The language model handles the messy parts (extracting data from PDFs that don't follow a consistent format), and the deterministic pipeline handles the calculation. Confidence comes from the determinism, not the model.

The reason this project moves faster than the others isn't just domain expertise. It's problem clarity. The output is a specific number on a specific form governed by a specific regulation. The spec can be precise because the requirements are precise. The evaluation is straightforward because the correct answer is mathematically verifiable. Tier 3, bounded problem, domain-expert builder — this is the profile that produces the fastest reliable systems.

The lesson: the methodology's speed isn't fixed. It scales with problem clarity and domain expertise. The harder the problem, the longer the spec phase. The more novel the domain, the longer the discovery. Declara IA is fast because Joen already knows what needs to be built.

Factory 4: VZYN Labs — Tier 2, Marketing Automation

The operation: VZYN Labs is an AI-powered marketing intelligence platform for digital agencies. The pre-audit playbook: an automated research and analysis sequence that profiles a prospect's market position — competitor analysis, keyword gaps, technical audit, performance benchmarking — and produces a report the agency uses to open new business conversations.

Why this one is here: The three Tier 3-4 projects above create an impression that the methodology is only for high-stakes, safety-critical work. VZYN Labs is here to correct that impression. Tier 2 applications — marketing, content, analytics, data processing — are where most teams start and where the methodology is fastest and least expensive to apply.

The intake: If the agent generates a bad competitor analysis or misreads a performance benchmark, a human reviews it and fixes it before it reaches the client. The worst realistic outcome is an embarrassing report and a wasted hour. Tier 2. Three minutes.

What Tier 2 looks like in practice: The spec is twelve pages instead of thirty. The test scenarios are seven with two stress variations instead of seven with five. The intent contract covers the pre-audit goal (demonstrate value, open a follow-up conversation) with a focused cascade. The CLAUDE.md has fewer hard boundaries — because fewer things are irreversible. The build sessions run faster because the validation overhead is lighter.

The VZYN Labs rebuild is the canonical story in this book precisely because it's a Tier 2 failure. The original thirteen-agent architecture was over-engineered for a Tier 2 problem. The rebuild — one agent, fifty-seven skills, deterministic playbooks — was right-sized to a Tier 2 problem. The methodology's value at Tier 2 isn't preventing catastrophes. It's preventing the slow, expensive failure of building something more complex than the problem requires.

Update (2026-04-15): VZYN stays here as the cautionary tale. The thirteen-agent collapse is the lesson, and the lesson doesn't move.

Update: shipped, under a new name (2026-04-15): The rebuild lives as Vision Labs. Same methodology, different shingle — single agent, skill catalog, the Brain as the research substrate, and Mia as the customer-portal agent. Stack: Next.js 16 + Supabase + Anthropic Claude via the Vercel AI SDK 6. Josh at Suncoast Interactive is the first paying customer; his agency is using it in live operations. I'll admit the rebuild took longer than it should have — the spec went through two more passes after the pivot, and the portal copy alone ate a week I didn't budget — but it shipped. Feedback is pending from Josh's team; the next move depends on what they hit first. What matters for this chapter: the methodology that came out of VZYN's Tier 2 failure is the one running in production right now, under a different name, against a real retainer. The cautionary tale and the working system are the same story told in two beats.

What we're watching: Josh's operators are the signal. If Vision Labs shortens the pre-audit loop and the retainer deliverables enough to change how Suncoast sells, the Tier 2 rigor paid for itself. If it doesn't, the investigation is whether skill coverage, playbook fit, or the Mia interaction is the bottleneck — and at Tier 2 we can learn that fast, patch fast, and iterate without dragging a safety-critical harness behind us.

What the Clients Experience

The methodology produces something from the client's perspective that matters as much as reliable software: trust in the process.

Oscar Isaza at Regasificadora didn't need to understand vector databases or trust tiers to make an informed decision about the AI system his executives would rely on. He needed to understand the answers to three questions: What did you build, exactly? How do you know it works? What happens when it's wrong?

The spec answered the first question. The test results answered the second. The certification and oversight structure answered the third. "What happens when it's wrong" — at Tier 4, a domain expert catches it before it reaches anyone who could act on it, and the incident feeds back into the evaluation library so the same failure is caught automatically in the future. That's a complete answer. Most AI vendors can't give a complete answer to the third question.

Clients in regulated industries, or clients with legitimate institutional risk concerns, have been burned by confident claims about AI accuracy that turned out to be accurate in controlled conditions and inaccurate in production. The methodology's documentation artifacts — the project card, the spec, the test results, the certification report — are evidence that the system was built deliberately, that its behavior was specified, and that its performance was verified. That evidence builds trust in a way that a demo doesn't.

Diego at Regasificadora was already using four different AI tools without integration, producing artisanal outputs per deliverable. The methodology's value proposition for him was not "AI is now accurate enough to trust." It was "here is the process by which we make AI outputs trustworthy enough to sign our name to." The distinction matters. The first promise is about model capability. The second is about governance infrastructure. One depreciates as models improve. The other appreciates.

The Hidden Difficulty

Every one of these projects encountered a version of the same difficulty, in different forms.

At Regasificadora, it was source document quality. The 227 documents that Module 3 depends on are of variable quality, age, and internal consistency. The discipline to assess document quality before beginning synthesis — and to refuse to synthesize from inconsistent sources — runs against every client expectation about what "fast" means. Clients who commission AI systems expect speed. The methodology requires that the speed begin after the foundation is solid.

At Ecomm, it was the discovery of what was actually in the knowledge base. The wiki had 500+ SOPs, but the discovery phase revealed that approximately 15-20% of them had informal addenda — updates known to experienced representatives but not written into the official document. These addenda lived in the heads of Loyalty team members, who existed precisely because the wiki didn't cover them. A search system built on the official SOPs would have been less complete than the informal knowledge it was intended to replace.

At Edifica, it was the Colombian legal system. Ley 675 de 2001 is the framework, but the implementation varies by municipality, by building type, and by local interpretation. The spec architect (me) is not a Colombian lawyer. The hard boundaries in the CLAUDE.md are derived from the law, but the interpretation of the law in edge cases requires domain expertise the system doesn't have. The boundary between what the system can resolve and what it must escalate is the spec's most important design decision — and it's a legal judgment, not a technical one.

At VZYN, it was the gap between what the pre-audit produces and what a prospect actually needs to see before engaging. A technically accurate competitive analysis isn't necessarily the analysis that converts a prospect. The system produces the truth about their competitive position. The sales conversation requires framing that truth in a way that motivates action. That framing is a human skill that the specification didn't capture and that the system can't replace.

These difficulties aren't failures of the methodology. They're examples of what the methodology is for: making the hard decisions explicit early enough to address them deliberately, rather than discovering them in production when the cost is higher.

The Timing Question

Why write a chapter about projects that are still in motion?

Not because nothing has shipped. As of April 2026, Edifica is deployed, the Ecomm SOP rewriter is running in production, Vision Labs (the VZYN rebuild) is in production with Josh at Suncoast Interactive as the first paying customer, and Nirbound CRM (formerly SonIA) is in production with its first paying customer — already brownfield, already running new enhancements out of a spec-vs-implementation audit. Attacca Claw Desktop ran its discovery session and has new specs in flight with brownfield fixes pending deploy. RDP is still Phase 1.

I realized writing this chapter that the honest frame isn't "nothing is in production." It's: not enough production time has elapsed to generate three-year performance data for any of these. Regasificadora's operational manuals are going to Ecopetrol on a sixty-day clock — the learnings from how they perform will arrive after this book is in readers' hands. Edifica just started finding real administrators. The Ecomm retrieval layer is still ahead of the SOP rewriter already running.

The lessons that matter for this chapter aren't the final outcomes. They're the design decisions, and those are available now.

What tier was set and why. What the CLAUDE.md's hard boundaries encode and why. What the spec's most difficult sections were. What the discovery phase revealed that the team didn't expect. What the intent contract's central conflict was. These are the things a practitioner building a similar system needs to know — and they're available from the design phase, not the production phase.

The four projects in this chapter are not finished success stories. A few have shipped and are now brownfield — meaning the methodology's next job on them isn't spec creation, it's spec maintenance under the weight of a live system. The others are still mid-build. Either way, the decisions made during design were made using the methodology described in this book, with the reasoning documented in the specs and project cards that guided them.

That's the honest version of case studies for a methodology that's being applied to projects that are just getting started. And honest is more useful than polished.

What All Four Have in Common

Four projects. Four industries. Four countries. Four different tiers. The same three-question intake. The same eight-phase pipeline. The same harness principles. The same evaluation architecture.

What changes is the overhead. Regasificadora has domain experts reviewing every deliverable. Ecomm has a pharmacist in the oversight chain. Edifica has a Colombian lawyer reviewing the governance modules. VZYN has a human reviewing the pre-audit report before it reaches the client — and that's the full extent of the mandatory oversight.

The overhead is not arbitrary bureaucracy. It's calibrated consequence. At Tier 4, the consequence of a wrong answer justifies the cost of multiple expert review gates. At Tier 2, the consequence of a wrong answer doesn't justify that cost — and applying Tier 4 overhead to Tier 2 work is not rigor. It's specification fatigue that slows the work without proportionate benefit.

The lesson from running four concurrent projects is that the methodology is not a template to apply uniformly. It's a framework to calibrate intelligently. The calibration is the tier. Get the tier right, and everything else follows at the right depth. Get the tier wrong — too low on a high-stakes system, or too high on a low-stakes one — and the methodology works against you.

Four factories. The engine is the same. The car is different. That's the point.

What These Projects Are Teaching the Methodology

Running real projects against the methodology reveals gaps that theory doesn't anticipate. Here are the ones showing up across the four factories.

The domain expert availability problem is systematic, not project-specific. Every Tier 3 and Tier 4 project depends on domain experts to validate outputs. Those experts are the busiest people in the organization. They're busy because they're the people who understand the domain well enough to validate. The methodology has a timeline assumption that domain experts are available for the reviews and sign-offs it requires. That assumption is wrong. The practical fix — building asynchronous review into the harness rather than synchronous sign-offs — is something the methodology needs to formalize.

The spec update discipline is harder to maintain than the spec creation discipline. Writing the initial spec is a contained effort with a clear end state. Updating the spec when the system changes is an ongoing obligation that competes with everything else. Every Tier 3 and Tier 4 project is already showing the first signs of spec drift — decisions made during build that weren't captured back into the spec document. The session log catches these in theory. In practice, the session log captures what was done, not always why the decision diverged from the spec. The next version of the methodology will formalize a spec patch protocol: any decision that deviates from the spec triggers an explicit update, not just a log entry.

The discovery phase is underspecified. Chapter 6 describes what to discover. These projects are revealing how to discover it — the specific questions that surface useful information versus the questions that produce comprehensive but non-actionable documentation. The Ecomm discovery uncovered the Loyalty team's informal knowledge precisely because the question was "what do experienced representatives know that isn't in the wiki?" That question isn't in the current discovery framework. It should be.

The intent contract becomes most important at the moments it wasn't written for. The Edifica intent contract resolved the privacy-versus-transparency conflict cleanly. But the governance module also encountered a conflict the intent contract didn't address: what happens when an administrator's instruction conflicts with a resident's legal rights under Ley 675? The spec handles specific scenarios. The intent contract handles general trade-offs. The gap between them — specific conflicts not anticipated in either document — requires a second-level intent: a conflict resolution process, not just a conflict resolution answer.

These aren't failures of the projects. They're the methodology learning from practice. The version of Dark Factory that runs these projects to completion will be more complete than the version described in this book — because the projects will have revealed what the book couldn't.

The final chapter asks the question every reader eventually arrives at: what do I do with this?

← All chapters