Part The Framework

Chapter

Discovery: Understanding Before Changing

The Resistance Is Real

According to a 2024 Cisco study of 2,600 privacy and security professionals across the globe, more than one in four organizations — 27 percent — have banned generative AI tools entirely. Another 61 percent restrict which tools employees may use.¹ Thirty percent of U.S. banks ban generative AI outright.² McKinsey found that 75 percent of executives consider AI strategically critical, while fewer than 25 percent have moved from pilots to production.³

The prohibition is not a Colombian posture. It is the dominant enterprise posture globally.

The same month Samsung allowed employees to use ChatGPT, three separate incidents occurred within twenty days. Engineers pasted semiconductor database source code, equipment defect detection code, and internal meeting minutes into the tool. Samsung's security team reached the conclusion that the IP had permanently left the corporate perimeter — ChatGPT retained user inputs for model training, and there was no mechanism to retrieve or delete what had been submitted. Samsung banned generative AI on all company devices the following month. Apple restricted ChatGPT and GitHub Copilot shortly after. Goldman Sachs, JPMorgan Chase, Citibank, and Deutsche Bank — Deutsche Bank's official framing was "protection against data leakage rather than a view on how useful the tool is" — all followed.⁴

Governments moved too. Italy became the first Western government to ban ChatGPT in March 2023, citing GDPR violations. In February 2025, Australia, Taiwan, and South Korea all banned DeepSeek from government devices within days of each other. The United States followed with agency-level restrictions and proposed legislation.⁵

The resistance is not irrational. A legal firm cannot risk client data flowing through a third-party model — a 2026 U.S. federal ruling established that AI tool use can waive attorney-client privilege when the platform's privacy policy allows data sharing with third parties.⁶ A bank cannot risk an agent touching transaction-processing code without a human oversight framework it can defend to the OCC or the FCA. An energy company cannot explain to its regulator how an AI modified the control system firmware if it cannot produce an audit trail for the modification.

The companies aren't wrong to be cautious. The problem is different: "prohibit everything" is not a governance model — it's a way to avoid building one. The evidence for this: the same studies that report prohibition find that 50 percent of employees use unapproved AI tools anyway. UpGuard found the number above 80 percent — including 90 percent of security professionals.⁷ The executives most likely to enforce the prohibition are the same executives most likely to be violating it.

Prohibition does not stop AI from entering the codebase. It stops the organization from governing how it does.

In early 2026, Stripe published a case study showing their internal AI agents merging 1,300 pull requests per week across a codebase of hundreds of millions of lines of code.⁸ The reaction from most legacy companies was not "we should do this." It was "that would never work here."

They were not wrong about the gap. They were wrong about whether the gap could be closed.

This chapter covers discovery — the phase that makes brownfield work tractable. The context matters: discovery is not only a technical phase. For a legacy company considering its first AI-assisted change to an existing codebase, the discovery document is also the security argument — the artifact that lets a security team evaluate a bounded, governed change instead of imagining an unbounded AI loose in production. Once you see that second function, you write the document differently.

The most expensive bugs come from changing code you don't understand.

Not code that's badly written — code that's undocumented, untested, and encoding behavior nobody remembers shipping. A function that runs on a cron job nobody set up. A validation rule that exists because of a customer complaint three years ago. An API endpoint that another team depends on but isn't in any integration docs. This is the territory of brownfield development — and it's where most real-world work happens.

The discovery phase produces a behavioral snapshot: what the existing system actually does, not what the documentation says it does. This snapshot becomes the foundation for the brownfield spec — the three additional sections (existing behavior to preserve, behavioral changes, regression scenarios) that prevent new code from breaking old behavior.

The Six Layers

Discovery examines six layers of the existing system, scoped to the blast radius of the planned change. Not the entire codebase — only the parts that the change will touch or could affect.

Layer 1: Structure. Directory map, framework identification, dependencies, schema. The question is not "how big is this codebase?" but "where does the thing I'm changing live, and what lives next to it?" A change to an order processing module requires understanding the directory structure of order processing, not the entire platform. Note the framework version — old frameworks have behavioral quirks new frameworks don't, and the agent needs to know which conventions apply.

Layer 2: Data model. Models, relationships, validations, constraints, state machines within the blast radius. This layer often reveals the most invisible constraints — a column that looks nullable but is never null in practice because a legacy migration filled it. A status field with five possible values documented and three more that appear only in the production database. A foreign key relationship that cascades in a way that isn't obvious from the schema. Discovery reads the schema, then reads the data migration history, then checks for discrepancies.

Layer 3: Behavioral contracts. For each endpoint or action in the blast radius: what triggers it, what it does, and what side effects it produces. This is the most important layer — it tells you what the system actually does as opposed to what it's supposed to do. Document it as: trigger → action → side effects. "When a payment is marked failed (trigger), the order status updates to pending-review (action), a webhook fires to the notification service, and a compliance log entry is created (side effects)." The side effects are almost always the part that breaks things.

Layer 4: Integration boundaries. External systems, data flows, failure handling, authentication. Where does this system talk to the outside world? What happens when those connections fail? A brownfield system built over ten years has integrations that were added pragmatically, not systematically. The payment webhook fires to one endpoint; the audit log writes to another service; the email notification calls a third-party API with its own authentication mechanism. Map them. For each integration, note what happens when it's unavailable — does the system fail hard, fail silently, or queue the operation for retry?

Layer 5: Test coverage. What's tested, what's not, what behaviors are asserted. The absence of tests is information — it tells you which behaviors are unprotected and most vulnerable to regression. Map the blast radius against the test coverage. A file with twenty tests tells you the behavior is known, specified, and verifiable. A file with zero tests tells you the behavior exists, was built by someone, and has never been formally verified since. Changes to untested code require regression scenarios before any modification — the discovery phase creates them, even if they don't exist yet.

Layer 6: Conventions. Naming patterns, error handling approaches, code organization, import structures. These aren't documented but they govern how the codebase expects new code to behave. Violate the conventions and the codebase fights you — not with errors, but with inconsistency that accumulates until a reviewer catches it or a merge conflict reveals it. Conventions are extracted by reading ten files and finding the patterns. How are errors returned? Are they exceptions or result types? Are functions named by their action or their subject? Does the codebase prefer early returns or nested conditionals?

After the six layers: identify tribal knowledge gaps — behaviors in code with no documentation, no tests, and no obvious reason for existing. A function that runs on a cron job nobody set up. A validation rule added after a customer complaint that appears in the code but nowhere in the requirements. An API endpoint that another team depends on but isn't in any integration docs.

These are the landmines. The tribal knowledge gap list is the most important output of the discovery phase — not the directory map, not the schema dump, but the list of things the agent cannot be trusted to touch safely. Every item on it is either documented before the change begins, or excluded from the blast radius. No exceptions.

The Discovery Document

The discovery phase produces one artifact: the discovery document. It is not a report and not a presentation. It is a structured input to the delta spec — the information the spec architect needs to define what the agent can change and what must survive unchanged.

A discovery document for a brownfield change looks like this:

## Discovery: Order Status Update Flow

### Blast Radius
Files in scope: `orders/processor.rb`, `orders/status_machine.rb`, 
`webhooks/payment_handler.rb`, `lib/compliance/order_log.rb`
Schema: `orders` table (columns: id, status, payment_status, compliance_log_id)

### Behavioral Contracts
- `PaymentHandler#process_failure`: 
  Trigger: Stripe webhook, event=payment_intent.payment_failed
  Action: Updates order.payment_status = "failed"
  Side effects: (1) Fires OrderStatusMachine#transition → order.status = "pending-review"
                (2) Creates ComplianceLog entry (immutable, insert-only)
                (3) Queues NotificationJob (async, retries 3x on failure)
  NOTE: ComplianceLog creation is NOT wrapped in the payment transaction.
  If the transaction rolls back, the log entry persists. This is intentional.

### Integration Boundaries  
- Stripe webhooks: inbound only, verified via webhook secret
- NotificationService: async queue, degrades gracefully (queued jobs retry)
- ComplianceLog: synchronous write, no fallback — if it fails, the request fails

### Test Coverage
- `payment_handler_spec.rb`: 14 tests, covers happy path + 3 error cases
- `status_machine_spec.rb`: 8 tests, covers all documented status transitions
- UNTESTED: ComplianceLog behavior on rollback (tribal knowledge gap)

### Tribal Knowledge Gaps
1. ComplianceLog intentionally persists on transaction rollback — no documentation,
   confirmed by reading git blame (added 2021, no commit message context)
2. Order status "pending-manual-review" appears in database but not in status machine —
   appears to be a legacy status from a previous system, never transitions out

### Conventions
- Error handling: raise exceptions, rescued at controller layer
- Status fields: snake_case string, not enum (historical: Rails 4 constraint)
- All compliance operations go through `lib/compliance/` — never inline

This document becomes the foundation for the delta spec. The spec architect reads it and adds three things the greenfield spec doesn't have: the existing behavior to preserve, the behavioral changes, and the regression scenarios that verify the existing behavior survived.

The tribal knowledge gaps in the document become hard constraints in the spec: "must not modify ComplianceLog persistence behavior" becomes a hard boundary in the agent's CLAUDE.md, because the gap is documented but the intent is confirmed-historical and must not change.

The Economics Test

Discovery has a hard constraint: it cannot cost more than the change itself. A discovery phase that takes two weeks for a change that takes two days has failed the economics test. The blast radius scoping is what keeps discovery proportional — you examine only what the change touches, not the entire system.

The decision tree is:

Skip discovery when:

The change touches fewer than three files, all with existing test coverage
The change is purely additive (a new endpoint, a new table, a new feature that doesn't touch existing behavior)
The change is a configuration update with no code logic

In these cases, the tests are the discovery. Read them. They tell you what behavior exists and what must survive.

Run discovery when:

The change modifies shared infrastructure (authentication, database connections, queuing systems)
The change modifies an API endpoint or action with unknown consumers
The change touches any file that has never been formally tested
The change modifies status fields, state machines, or anything with a compliance obligation
The first instinct is "I'm not sure exactly what this does"

That last one is the honest heuristic. When you look at the file you're about to change and realize you don't fully understand it, discovery is not optional — it's the work you were going to do anyway, now structured and documented so the agent has it too.

The economics argument against discovery is almost always wrong in the medium term. The two days you save by skipping discovery on an integration-boundary change get consumed by the half-day debugging session when the webhook stops firing — because a change to the payment handler broke a side effect nobody wrote down. Discovery's cost is upfront and visible. The cost of skipping it is delayed and invisible, until it isn't.

A Real Brownfield Walkthrough: Nirbound CRM

I realized we had a brownfield problem the moment the first customer started using Nirbound CRM in production.

The system shipped. It worked. One paying customer, real deals moving through the pipeline, real proposals being generated. And then the next sprint arrived — a list of enhancements, new agent flows, Spanish UI, real-time messaging — and I caught myself about to do the thing I tell everyone else not to do. I was about to brief an agent on "add features X, Y, Z" against a codebase I shipped months earlier and no longer held in my head. That's brownfield. A running system with a customer attached, a spec that drifted during implementation, and a next sprint waiting to either respect what exists or quietly break it.

So before writing a single delta spec, I ran discovery against the codebase. Not the full six layers — the blast radius was the entire product surface, because the enhancement list touched most of it. The output was the 77-feature spec-vs-implementation audit, and the discipline that made it useful was forcing every feature into one of three buckets.

Shipped and working (52 features). Invite-only auth, RLS multi-tenancy, the 11-stage Kanban, the proposal wizard steps 1/2/4/5, the agent job lifecycle, the admin dashboard, the client portal with visual annotations. These are the behaviors that must survive the next sprint. They become the "existing behavior to preserve" section of every delta spec that follows.

Wired-but-unused — infrastructure present, behavior absent (7 features). This is where the landmines live. The sendEmail() function exists in the Resend client — invitation emails, notification emails, client feedback emails — and is never called from anywhere. Someone (me, months ago) built the infrastructure and forgot to wire it. That's the most common brownfield landmine: capability present, behavior absent. Same pattern for Linear escalation (createLinearIssue() exists, zero call sites), Supabase Realtime (use-realtime.ts hook imported by zero components), Datadog metrics (class exists, only prints to stdout with a TODO), and the translation_cache table (sits empty in the schema while every UI string is hardcoded English).

Without the audit, the next sprint would have re-built Resend sending from scratch — duplicating infrastructure that was already there. Or worse, built a parallel notification system alongside the dormant one and left two half-wired paths competing.

Not implemented — genuinely missing (18 features). Admin impersonation, PM Agent auto-trigger on Close Won, pricing document generation, invoice generation, real-time messaging (no messages table), 80% usage-limit warnings, annotation-to-task creation, Spanish i18n, responsive mobile. These are the actual new work.

The contrast is the whole point. Without the inventory, I would have treated "add Spanish UI" as a greenfield feature and missed that the translation_cache table was already waiting. I would have re-specified email notifications and missed that the client was one function call away from working. I would also have treated "real-time messaging" as a quick win until the audit made it clear there's no messages table at all — it's net-new schema, not a wiring job.

That 3-category inventory became the input to the delta spec. Shipped features turned into preservation constraints. Wired-but-unused features turned into wiring tickets, each one tiny and cheap. Missing features turned into actual new scopes with their own specs. The next sprint stopped being a wish list and started being a prioritized stack with honest cost estimates.

That's what discovery buys you on a live system: a map of what's already paid for, what's half-paid-for, and what the next dollar actually builds.

How Stripe Does This at Scale

In February 2026, Stripe published the first detailed account of what it looks like when AI agents work on a production codebase at scale.

Their system is called Minions — a fleet of autonomous coding agents, each given a single task, working in parallel across a codebase of hundreds of millions of lines of code. In the two weeks between their first and second public posts, Minion output went from 1,000 to 1,300 pull requests per week. Every one of those PRs was AI-authored. Every one was human-reviewed before merge. No PR touched production without a human approving it.

The numbers that matter are not the output numbers. The numbers that matter are the infrastructure numbers: over 3 million tests in their test battery. Four hundred internal MCP tools in a system they call "Toolshed." Pre-warmed development environments that spin up in ten seconds. Sourcegraph for code intelligence. Steve Kaliski, a Stripe engineer who worked on the project, told Lenny Rachitsky in March 2026 that he "hasn't started work in a text editor in months" — he starts in Slack, or Google Docs, or a support ticket, and Minions produce the code.

Stripe's architectural insight is something they called Blueprints. Each Blueprint is a workflow template that alternates between two types of nodes: deterministic nodes (run the linter, run the tests, format the output — the same every time) and agentic nodes (reason about the problem, design the solution, write the code — creative, variable, model-driven). The alternation is not incidental. It is the reliability mechanism. The deterministic nodes enforce consistency; the agentic nodes do the creative work. Stripe arrived at the same principle the Dark Factory methodology encodes: the agent does the creative work, the harness does everything else.

Their most important stated insight: "What's good for human developers is good for AI agents." The existing developer productivity infrastructure at Stripe — the test battery, the code intelligence tools, the CI pipeline, the code review process — transferred directly to agent infrastructure. They didn't build a parallel system for agents. They adapted the existing system.

This is the existence proof that brownfield AI works at scale. But Stripe's case is the hardest possible version of the statement. Most organizations cannot replicate it.

What Stripe's Infrastructure Actually Is

Stripe's three million tests are not just quality assurance. They are a behavioral map of the entire codebase.

When a Minion agent is given a task — "migrate this API endpoint to v2" — it doesn't need a discovery document. The test battery provides one. The tests tell the agent what the endpoint does, what inputs it accepts, what outputs it produces, what side effects it triggers, and what invariants must survive the change. The discovery layer is built into the codebase by years of disciplined test-writing.

The 400+ tools in Toolshed are not just convenience. They are a controlled interface between the agent and the codebase. The agent doesn't have open access to the file system or production systems — it interacts through tools that define what it can and cannot do. Toolshed is a harness at scale. The tools are the blast radius controls.

The pre-warmed devboxes are not just performance optimization. They are isolation infrastructure. Each agent runs in its own sandboxed environment with no access to other agents' work, no production credentials, and no ability to affect production until a human merges the PR. The isolation is what makes the security argument work — an agent that cannot affect production without human approval is an agent a security team can reason about.

Most organizations have none of this. Not the test density, not the tool library, not the isolation infrastructure, not the code intelligence. If you try to apply Minions-style autonomous agent deployment to a codebase with forty percent test coverage, no documentation, and no separation between development and production access, you will produce exactly the failures the security teams are afraid of.

This is not a failure of the approach. It's a failure of prerequisites. And this is precisely where methodology fills the gap that infrastructure doesn't cover.

Dark Factory's six-layer discovery produces, for a specific change, what Stripe has globally for their entire codebase. The discovery document is a manually-constructed behavioral snapshot of the blast radius. The delta spec is the equivalent of the test contract for the changed behavior. The regression scenarios are the equivalent of the test battery for the affected code.

It takes longer. It is less scalable. It is possible without three million tests.

The Security Argument

The security objection to brownfield AI development takes several forms, and each one has a specific answer.

"We can't let an AI touch production code."

No one is suggesting otherwise. Stripe's Minions system merges 1,300 PRs per week — every one reviewed and approved by a human engineer before it touches production. The Dark Factory methodology requires human approval at every phase gate, and the deployment trigger is always a human action. AI-authored does not mean AI-deployed. The security concern about AI affecting production systems without human control is legitimate; the answer is that the methodology addresses it structurally, not by policy.

"We don't know what the AI will change."

This is the discovery problem, and it's real. An agent given broad access to an undocumented codebase and an ambiguous instruction will make unpredictable changes. The answer isn't to prohibit the agent — it's to solve the discovery problem first. The blast radius scoping establishes exactly what the agent is authorized to touch. The spec defines exactly what behavior it must produce. The hard boundaries define what it is never allowed to do. Before the agent writes a line, the security team can read the spec and confirm that the authorized changes are acceptable.

"If something goes wrong, we can't explain what happened."

The harness produces an audit trail. Every agent decision is logged. Every file changed is tracked. The session log records what was attempted, what failed, and why. The deterministic validation rules run against every output. A brownfield change made through the Dark Factory pipeline produces more documentation of what happened than most human developers leave behind. The explainability problem is a process problem, not an AI problem — and the harness solves it.

"Our data is confidential. We can't send it to an AI model."

This is the most legitimate concern, and it requires the most direct answer. The discovery phase does not require sending production data to a model. It requires sending code — the structure, the behavioral contracts, the test coverage, the integration points. In most cases, the code itself is not the confidential asset — it is the interface to the confidential asset. A healthcare company's patient data is confidential; the code that processes it is not. For cases where the code itself contains regulated information (embedded credentials, hardcoded configuration, data in comments), the discovery phase identifies those artifacts — and removing them is part of the preparation work that makes the codebase ready for agent interaction.

For the cases where the code genuinely cannot leave the building — classified systems, regulated financial models, sovereign data requirements — the answer is on-premise deployment. The models can run locally. The methodology is model-agnostic.

"The regulators will ask questions we can't answer."

The regulators are asking questions that most human development processes also can't answer. "Who approved this change? What testing was done? What is the audit trail for this modification?" The Dark Factory pipeline produces systematic answers to all of these questions, because the phase gates require documented artifacts at each transition. A Tier 4 brownfield change produces a spec, a discovery document, a test report, a certification sign-off, and a deployment log. That is more regulatory documentation than most manual development processes provide.

The security argument is not "trust the AI." It is "here is the controlled environment in which the AI operates, here are the limits on what it can do, here is the human review process that governs every change it makes, and here is the audit trail that documents everything that happened." That argument is answerable. The prohibition policy doesn't require answering it — which is why prohibition is easier, in the short term, than governance.

The Brownfield Hypothesis

The first organization that gives its security team a governance model instead of a prohibition policy will have a first-mover advantage in AI-assisted legacy modernization. That company will move faster on technical debt than its competitors. It will have a methodology proven on its own codebase. And it will not have to pay the cost that early movers pay when there is no methodology — because the methodology exists now, even if the case study doesn't yet.

This chapter is the methodology waiting for the case study.

The signals that indicate a legacy organization is ready to move from prohibition to governance are specific. They have nothing to do with AI sophistication and everything to do with process maturity:

Signal 1: They have started documenting what they have. An organization that is writing down what its systems actually do — not what they were designed to do, but what they do — is building the discovery foundation without knowing it. That documentation is the beginning of the blast radius model.

Signal 2: They have separated development from production access. An organization with genuine dev/staging/production separation, where no development activity can affect production without an explicit deployment action, has eliminated the primary safety concern. An agent in a sandboxed development environment with no production credentials is categorically different from an agent with open access.

Signal 3: They have started measuring test coverage. An organization that is tracking test coverage is acknowledging that undocumented behavior is a risk. The test coverage metric is a proxy for how much of the behavioral map already exists. At forty percent coverage, the discovery phase fills the remaining sixty percent. At eighty percent, discovery is faster and the regression scenarios are largely already written.

Signal 4: They have someone internally arguing for AI governance rather than AI prohibition. This person exists in almost every organization — a developer, a technical lead, an innovation director — who sees the gap between what AI can offer and what the security policy allows, and is building the internal argument. That person is the organizational entry point for the methodology.

The brownfield opportunity in legacy industries is not a question of whether. It is a question of when, and who has the methodology ready when the dam breaks.

Discovery as the Security Argument

The security team that prohibits AI isn't prohibiting AI because they hate efficiency. They're prohibiting AI because they can't answer the question their regulator or their executive will ask: "What did the AI change, and how do you know it's correct?"

Discovery answers that question before the agent writes a line. The blast radius document says: these are the files in scope. The behavioral contract inventory says: these are the behaviors that must survive. The tribal knowledge gap list says: these are the behaviors we don't fully understand, and the agent will not touch them. The regression scenarios say: here is how we verified, after the change, that nothing broke.

A security team that can read these artifacts before the agent starts work is a security team that has something concrete to evaluate. Not "we're letting an AI loose on the codebase" — "here is the bounded scope, the defined constraints, and the verification plan for a specific change."

This is the path from prohibition to governance. It does not require trusting AI. It requires trusting process — a process that produces auditable artifacts at every step, requires human approval before production is touched, and makes the AI's work as reviewable as any human developer's pull request.

The first brownfield case study in the Colombian market will not come from a company that suddenly trusts AI. It will come from a company whose security team found, in the discovery document and the delta spec and the regression results, enough structure to say yes. Discovery is the phase that makes that conversation possible.

The next chapter moves from understanding the system to specifying the change — the hardest and most load-bearing skill in the pipeline, and the one everything downstream leans on.

Cisco, 2024 Data Privacy Benchmark Study (Cisco Systems, 2024). Survey of 2,600 privacy and security professionals across 12 countries. [VERIFY — confirm exact report title, publication date, and 27%/61% figures] ↩
Bank AI prohibition figure (30% of U.S. banks). [SOURCE — confirm originating report; possibly American Bankers Association survey or Bloomberg/Reuters industry reporting, 2023–2024] ↩
McKinsey & Company, AI adoption survey. [VERIFY — confirm exact McKinsey report and publication date for the 75% strategic priority / <25% production deployment figures; likely from The State of AI annual report series] ↩
Samsung ChatGPT data leakage incidents (April–May 2023): reported by Bloomberg, Reuters, and The Verge (May 2023). Deutsche Bank quote confirmed via Financial Times coverage. Goldman Sachs, JPMorgan, Citibank restrictions reported by Bloomberg (March–May 2023). [VERIFY — confirm precise months and quotes against original reporting] ↩
Italy's ChatGPT ban: Garante (Italian Data Protection Authority), March 31, 2023. Australia, Taiwan, South Korea DeepSeek bans: multiple government announcements, February 2025. U.S. agency-level restrictions on DeepSeek: White House memo, January–February 2025. [VERIFY — confirm specific government orders and dates] ↩
2026 U.S. federal ruling on attorney-client privilege and AI tool use. [SOURCE — identify the specific case name, federal circuit, and date; this is a forward-dated reference that may need verification as of publication] ↩
UpGuard, unauthorized AI tool usage survey. [SOURCE — confirm report title, publication date, and the 80%+ / 90% of security professionals figures] ↩
Stripe, "How we built Minions" engineering blog post (early 2026). The 1,300 pull requests per week figure and Steve Kaliski's quote sourced from Lenny Rachitsky, "How Stripe built an internal AI system that merges 1,300 PRs a week," Lenny's Newsletter (March 2026). [VERIFY — confirm exact publication dates and Stripe's official post title] ↩

← All chapters