Every system in the Dark Factory methodology has exactly one trust tier. The tier is chosen during intake, before any spec is written, and it governs the rest of the pipeline — how many behavioral scenarios, how much stress testing, how much human sign-off, how often the evaluation suite re-runs.
This appendix is the reference for getting that choice right.
I'll be honest about the limitation up front: tier classification is a judgment call. There's no formula that takes a description of a system and returns a number. What this appendix gives you is the question to ask, the table of consequences each answer implies, worked examples from projects I've shipped, and the common edge cases that trip people up. Apply it with domain context. Don't mechanize it.
The One Question
What is the worst realistic outcome if this system gets it wrong?
"Realistic" is the load-bearing word. Every system could, in some contrived scenario, cause catastrophic harm. A spellchecker could, theoretically, autocorrect a medication name into a dangerous one. A chart library could, theoretically, render a graph that a clinician misreads. If you tier by worst-case imagination, everything becomes Tier 4 and the tier stops discriminating. Tier by the outcomes you actually expect to see in normal operation of the system as designed.
| Answer | Tier | Domains |
|---|---|---|
| Annoyance, retry, minor inconvenience | Tier 1 — Deterministic | Internal tools, content drafting, dev utilities |
| Wasted time, wasted resources, wasted money | Tier 2 — Constrained | Marketing automation, data processing, reporting |
| Financial or reputational damage, legal exposure | Tier 3 — Open | Customer-facing agents, financial tools, hiring systems |
| Legal liability, safety risk, irreversible harm | Tier 4 — High-Stakes | Healthcare triage, safety-critical ops, regulated industries, financial trading |
What Each Tier Requires
| Element | Tier 1 | Tier 2 | Tier 3 | Tier 4 |
|---|---|---|---|---|
| Behavioral scenarios | 7 minimum | 7 + 2 variations each | 7 + 3 variations each | 7 + 5 variations each |
| Intent contract | Optional | Recommended | Required | Required + domain expert review |
| Stress testing | None | Structural edges (Cat D) | Social + framing + structural (A, B, D) | All categories + reasoning alignment (A–E) |
| Deterministic validation | Optional | Key outputs | All outputs | All outputs + dual-check |
| Progressive autonomy | Full auto | Auto + logging | Human oversight | Human mandatory |
| Continuous evaluation | Not needed | 10% sampling | 25% sampling | Full coverage + audit |
| Stress-test cadence | Not needed | Before deploy | Before deploy + quarterly | Before deploy + on any change |
| Human sign-off | Deploy approval | Spec + deploy | Spec + intent + test + deploy | All gates + domain expert |
Read the rows down, not across. The point is not that Tier 4 is "harder" than Tier 1 — it's that every pipeline element has a different answer depending on tier, and skipping a row because another row says "optional" is how you end up with a Tier 3 system deployed with Tier 1 oversight.
Examples Per Tier
Tier 1 — Deterministic. A script that generates release notes from commit messages. An internal tool that summarizes last week's team standup. A dev utility that renames files by pattern. If the tool produces garbage, you notice immediately, retry, and nothing is lost. Full automation. Seven behavioral scenarios. No intent contract. No stress testing. Deploy-time approval only. This tier exists to keep the methodology from feeling like overhead on things that don't need it.
Tier 2 — Constrained. VZYN Labs, the marketing agent, runs here. The worst realistic outcome is a bad SEO recommendation, a misdirected piece of content, or a monthly report that misreads client data. Time and money wasted. Reputation in a client relationship bruised, not broken. Tier 2 adds two stress-test variations per scenario (typically Category D structural edges), recommends an intent contract, and requires spec approval plus deploy approval. 10% of production output is sampled into a flywheel for continuous review.
Tier 3 — Open. Customer-facing agents, financial tools, hiring systems. SonIA CRM at the point where it started handling client-facing workflows crossed into Tier 3 — a mis-routed deal or a badly stated pricing quote has real downstream consequences. Tier 3 requires an intent contract, runs social and framing stressors alongside structural edges, and gates deployment behind spec + intent + test + deploy sign-offs. 25% production sampling.
Tier 4 — High-Stakes. BuildingMgmtOS under Ley 675 de 2001 compliance, the Regasificadora del Pacífico safety-procedure agent, a pharmacy-referral call center. Legal, safety, or irreversible harm is on the table. Tier 4 runs all five stress-test categories including reasoning-output alignment, requires domain expert review of the intent contract, and mandates human-in-the-loop for every production decision. Full evaluation coverage. Stress test re-runs on any model swap, prompt change, or architectural change.
Edge Cases
Multi-tier systems. Most real products are not uniformly tiered. BuildingMgmtOS has a Tier 4 financial module (Ley 675 compliance), a Tier 2 maintenance-request module, and a Tier 1 notification subsystem. The correct approach is not to pick the highest and apply it everywhere — that produces specification fatigue and ships nothing. Tier each module separately. The product has tiers; the system doesn't.
The catch is cross-module flow. If the Tier 2 maintenance module feeds into the Tier 4 financial module (for example, if repair costs affect the reserve fund calculation that governance rules constrain), the connecting interface has to be treated as Tier 4. The consequence crosses the boundary. Tier up at the seam.
Partial scope. A feature inside a Tier 3 system is usually Tier 3 — inherited from the consequence model of the containing system. But a feature that operates entirely in read-only preview mode, or behind an explicit "draft" flag that can't be published without human approval, can legitimately be tiered lower. The test: if this feature produces a wrong output, does any user or system act on it before a human sees it? If no, you can tier down. If yes, stay at the containing tier.
New domain, unclear consequence. When a system operates in a domain you don't understand yet, default up, not down. I did the reverse on the original VZYN Labs build — treated a multi-agent marketing platform as if marketing mistakes were low-consequence, and shipped an architecture that couldn't be simplified without a rebuild. Over-tiering costs time. Under-tiering costs consequences. The cost is asymmetric. Tier up.
Tools operating on other tools. A code-modification agent operating on a Tier 4 codebase inherits Tier 4 scrutiny even if the agent itself feels like developer tooling. The consequence model of the modified system dominates. The spec-drift detector that watches Edifica's spec is a Tier 2 tool; the code-change agent that modifies the Edifica production codebase is Tier 4.
How Tiers Migrate
Trust tier is not a permanent tattoo. Systems move through tiers as they mature and as their role changes. The rule is that migration is always a deliberate act — you don't drift tiers, you move them.
Tier-down migration is rare and requires evidence. A system originally classified Tier 3 can move to Tier 2 after six months of production data showing that the actual consequence of errors was smaller than predicted. The evidence is not "no complaints received" (absence of evidence is not evidence of absence). The evidence is a documented review of production errors and their measured downstream impact. Tier-down requires human sign-off by whoever owned the original classification — ideally with a second reviewer who wasn't involved in the original tier decision.
Tier-up migration is more common and requires less evidence. If a system starts producing outputs that are being consumed in higher-stakes ways than originally scoped — a Tier 2 reporting tool whose outputs start informing legal or financial decisions — the tier increases. Tier-up is immediate. The pipeline requirements for the new tier apply to the next change, not retroactively, but no further deployments happen under the old tier.
Tier-up at integration. The most frequent cause of migration is integration. A Tier 2 system exposed as an API that other Tier 4 systems consume has to be re-evaluated. You can't claim Tier 2 for a component whose output is load-bearing inside a Tier 4 decision. This is how systems silently drift into higher consequence without the tier reflecting it.
Tier-stable over model upgrades. Tier doesn't change when the model changes. The consequence of a wrong output is a property of the system, not the model. A model upgrade might change baseline accuracy or anchoring susceptibility — that's why stress tests re-run — but the tier is unchanged. If GPT-N+1 makes a Tier 4 system more accurate, it's still Tier 4.
Key Principles
- Classify the consequence, not the complexity. A one-line function that routes a user to the right clinician is Tier 4. A twenty-thousand-line marketing pipeline is Tier 2. Lines of code tell you about engineering effort, not risk.
- Set the tier once, at intake. It governs every downstream decision in the methodology — scenarios, stress tests, sign-offs. If you try to decide the tier as you go, you'll always pick the one that's cheapest at that moment.
- When in doubt, tier up. Tier 4 methodology applied to a Tier 3 system is over-engineered but safe. Tier 2 methodology applied to a Tier 4 system is an incident waiting to happen. The cost of the mistake is asymmetric.
- Tier 4 means human mandatory, forever. That's the design, not a limitation. Tier 4 systems earn more autonomy over time on specific sub-tasks through progressive autonomy (shadow → supervised → autonomous for narrow scopes), but the human-in-the-loop requirement for the system as a whole is a structural property of Tier 4, not a temporary constraint to be optimized away.
- Document the tier decision. The tier lives in the project card at intake, in the spec header, and in the decision log. Three months later, when someone asks why this system has 30 scenarios and stress tests on every prompt change, the tier choice should be traceable back to the originating question and the answer.
One More Caveat
Trust tiers are a frame, not a standard. The four-tier model comes from the practical need to scale specification and evaluation rigor against real consequence. Emerging certification regimes — ISO 42001, the EU AI Act's high-risk categories, sectoral regulations in healthcare and finance — have their own tiering schemes with their own legal force. If your system falls under a regulatory regime, that regime's classification governs; the trust tier is how you implement against it, not a replacement for it. Where a regulator says "high-risk," read it as Tier 4 and add whatever the regulator specifically requires on top.
The tier is a tool for engineering discipline. The regulation is the floor. Your job is to make them compatible.