Part The Framework

Chapter

Certify, Deploy, Maintain: The Last Mile

Most agent failures happen after deployment, not before.

The system passes testing. It meets the tier thresholds. The certification checklist is complete. It deploys. And then, slowly, it degrades. The model provider updates their system. User inputs drift from the distribution the system was tested against. Edge cases accumulate that no scenario anticipated. A metric that was 95% at launch quietly drops to 87% over three months — and nobody notices until a user complains.

Certification is the gate. Deployment is the transition. Maintenance is everything after.

Certify: The Human Gate

Certification is where a human — not a metric, not an LLM judge, a human — reviews all artifacts and confirms the system meets the bar for its trust tier. It's the gate between "built and tested" and "allowed to ship."

The certification review is not a rubber stamp. It's a deliberate, structured examination of three questions that automated testing cannot answer:

Does the code match the behavioral contract? The test suite verifies that specific scenarios pass. Certification verifies that the scenarios represent the right behaviors — that what was tested is what matters, and that what wasn't tested isn't lurking as a gap. This is a human judgment call. A test suite that covers 97% of the scenarios in the test library might cover 60% of the behaviors that users will actually encounter. The certifier asks: is this test library representative of reality?

Are there suspicious passes? Some scenarios pass too easily. They might be passing because the evaluation criteria are too loose — the LLM-as-judge is grading on a curve, accepting outputs that technically meet the letter of the scenario but violate its spirit. Others pass because they're testing the wrong thing: the scenario asks "does the output contain the required fields?" when the actual question is "does the output make clinical sense?" Suspicious passes reveal evaluation gaps.

Could this system succeed at the wrong thing? This is the Klarna checklist — the single most important question in certification. Not "does it pass the tests" — "does it pass the tests on the right thing." The system does exactly what the spec says. Is what the spec says what the organization actually needs?

A contract review tool that summarizes contracts correctly but never flags missing indemnity clauses is succeeding at the wrong thing. A customer service agent that resolves tickets quickly at the cost of customer relationships is succeeding at the wrong thing. A building governance system that generates compliant reports but fails to alert administrators about upcoming assembly deadlines is succeeding at the wrong thing.

The Klarna checklist can't be automated because it requires understanding the gap between specification and organizational intent. The certifier reads the test results, looks at a sample of actual outputs, and asks: if this system operated at scale, would the aggregate effect match what the organization actually needs?

What Certification Looks Like by Tier

Every tier requires the baseline certification review. What changes is the scope and depth.

Tier 1 — Assisted: Code review against spec. Test results review. Deployment artifact review (what will be deployed, where). One human. Thirty minutes for a clean build. The certifier is usually the same person who built the system.

Tier 2 — Constrained: Everything in Tier 1, plus stress test review. Are the variation stability numbers from the factorial stress testing within threshold? Are failures concentrated in one stress category — suggesting a systematic gap rather than random variation? Has the LLM-as-judge been calibrated, or is it grading at the same level as the model it's evaluating? Thirty minutes to two hours depending on the complexity and the test results.

Tier 3 — Supervised: Everything in Tier 2, plus intent contract alignment. The certifier runs the Klarna checklist explicitly: what is the system optimizing for, and is that what the organization needs? Scenario library representativeness is reviewed by someone who knows the domain, not just the system. The evaluators and the builders are different people — a fresh set of eyes on whether the test cases reflect how real users actually behave. Two to four hours. Involves at least two people.

Tier 4 — Controlled: Everything in Tier 3, plus domain expert review. A pharmacist reviews the medication guidance scenarios. An LNG operations engineer reviews the safety compliance scenarios. A Colombian lawyer reviews the governance compliance scenarios. The domain expert isn't reviewing the code — they're reviewing the system's behavior in context. They ask: does this do the right thing, not just the technically specified thing? Half a day to a full day. Multiple reviewers across different dimensions of the domain.

When certification fails: The system goes back. Not forward with caveats, not shipped with a "we'll fix it later" commitment. Back. The failure goes into the evaluation library as a new test case. The gap that the certification found — spec ambiguity, evaluation weakness, wrong success metric — is fixed before the next certification attempt. Certification failures are the most valuable input the methodology produces: they're the discovered gaps that no automated test found.

The Progressive Autonomy Sequence

Deployment is not binary. A system doesn't go from "not deployed" to "fully autonomous" in one step. The Dark Factory approach uses a progressive autonomy sequence that reduces risk at every transition.

Shadow mode: The system runs alongside the existing process. It processes the same inputs and produces outputs — but those outputs go to a log, not to users. Humans continue to do the work. The system's outputs are compared to the humans' outputs, revealing gaps before any user encounters them.

Shadow mode is the most valuable phase for high-tier systems. It lets you observe what the system would have done in real conditions, with real inputs, against real reference outcomes. Gaps that weren't visible in testing become visible in shadow mode because real users phrase things in ways the test scenarios didn't anticipate.

The shadow mode threshold for moving to supervised: the system's outputs match or exceed human reference outputs on 90% of cases at Tier 2, 95% at Tier 3, and 99% at Tier 4. Below the threshold, the shadow data goes back to the spec and evaluation teams as new training material.

Supervised mode: The system produces outputs that a human reviews before they reach users. Every output is inspected. Errors are caught before they propagate. The human is doing work, but they're reviewing AI output rather than producing from scratch — which is faster.

Supervised mode runs until the error rate in human review drops to within threshold. At Tier 2, that threshold is 5% of outputs requiring correction. At Tier 3, it's 2%. At Tier 4, the system may run in supervised mode indefinitely — for systems where human review is mandatory by design, "supervised mode" is the steady state, not a transition.

Auto with logging: The system produces outputs autonomously, but every output is logged and a percentage is sampled for quality review. The flywheel is running. Human review is triggered by the evaluation layer, not by default.

Full autonomy: The system operates independently. The flywheel monitors for drift. Human review is triggered by the evaluation layer when quality metrics cross thresholds or when specific output patterns trigger escalation rules.

Escalate back: At any point in the sequence, if quality metrics degrade significantly or if a class of failures appears that the harness can't contain, the system steps back one level in the autonomy sequence. A fully autonomous system that encounters a new failure mode reverts to auto-with-logging until the failure is understood and contained.

The progression is not a one-way door. Systems that experience incidents regress. Systems that demonstrate consistent reliability at one level advance. The autonomy level is set by evidence, not by schedule.

Deploy: Production Setup

Deployment should be boring. If shipping is exciting, something is wrong.

The deployment checklist ensures that correctly-built software isn't deployed to an incorrectly-configured environment. This is a real and underappreciated failure mode: a system that was correct in the test environment fails in production because of configuration differences that have nothing to do with the code.

The core deployment checklist items apply to every tier:

## Deployment Checklist

Infrastructure:
- [ ] Production environment provisioned and verified
- [ ] Environment variables set (no hardcoded development values)
- [ ] Database connection strings verified against production instance
- [ ] Authentication configured for production tenant (not test tenant)

Data:
- [ ] Test data removed from production database
- [ ] Data isolation verified (tenant A cannot see tenant B's data)
- [ ] Seed data for required reference tables present

Connectivity:
- [ ] All external API credentials valid in production
- [ ] Webhook endpoints configured and responding
- [ ] Rate limits verified and appropriate for production volume

Monitoring:
- [ ] Error logging configured and routing to alerting system
- [ ] Core flows verified in production environment
- [ ] Rollback procedure documented and tested

Harness:
- [ ] Evaluation flywheel connected to production outputs
- [ ] Sampling rate configured for tier
- [ ] Escalation routing configured (where do flags go?)

The reason this checklist exists as a separate phase — rather than being folded into BUILD or CERTIFY — is that deployment configuration errors are different in kind from code errors. They're not caught by tests. They don't appear in code review. They reveal themselves when the system runs in the production environment, against production data, with production credentials. A deployment checklist that runs after certification and before go-live is the only systematic way to catch them.

Maintain: The Continuous Flywheel

MAINTAIN never exits. It runs as long as the system is live. I realized early that this is the phase most teams skip — and it's the phase where the worst failures hide.

The MAINTAIN flywheel has four components that operate continuously:

1. Deterministic Validation

Every agent output flows through deterministic rules before it reaches users. The rules check for specific, verifiable conditions — not quality in the abstract, but concrete constraints that can be checked programmatically.

For a medication guidance system: does the output contain a source citation? Is the citation to an approved source in the knowledge base? Does the output flag the interaction as requiring pharmacist verification (for Tier 4 categories)? These are deterministic checks. They either pass or fail. Failed outputs route to human review before reaching the customer service representative.

The deterministic validation layer is the harness operating in production. It's not evaluation — it's enforcement. It catches the outputs that violate hard boundaries before they propagate.

2. LLM-as-Judge Sampling

A percentage of outputs that pass deterministic validation are evaluated by an LLM judge for quality. The judge model must be different from the agent being evaluated — you don't ask the student to grade their own test. Ideally it's a stronger model, with access to the evaluation criteria and the source material the agent used.

The judge evaluates: Is the output factually correct given the source material? Is it appropriately calibrated (not overconfident about uncertain things)? Does it match the tone and format specified in the spec? Does it represent what the organization would want said in this context?

The sampling rate scales with the tier. Tier 2: 10% of outputs evaluated by the judge. Tier 3: 25%. Tier 4: 100% — every output is evaluated, every output has a human in the loop.

The critical practice: audit the passed runs. Sample 5-10% of outputs that the judge approved and have a human check them. This is where the evaluation library grows — from discovering what the automated evaluation missed. The judge's false negative rate — the quality issues it failed to flag — is the most important metric in the flywheel, because it represents the class of failures that are reaching users uncaught.

3. Human Review

Flagged outputs — by deterministic validation or by the LLM judge — route to human review. The human reviewer adjudicates: true positive (the flag was correct and the output needs fixing) or false positive (the flag was incorrect and the output was fine).

True positives feed back into the agent — the problem is diagnosed, the spec or harness is updated, and similar outputs are prevented going forward. False positives feed back into the evaluation rules — the deterministic validator or the LLM judge is updated to stop flagging this pattern.

Both types of feedback improve the system over time. The flywheel doesn't just catch failures — it generates improvements. The evaluation library grows from human review. The harness gets tighter from true positives. The evaluation precision improves from false positives. Each reviewed output makes the next cycle more accurate.

4. Drift Detection and Intent Contracts

Intent contracts include alignment drift indicators — leading metrics that signal when the system is drifting from its intended behavior before the lagging metrics catch it.

Response time increasing is a trailing indicator. If the system is getting slower, something is wrong — but it's already wrong before you see the trend. Confidence calibration shifting is a leading indicator. If the system's confidence scores are changing distribution without a corresponding change in accuracy, the model's behavior has shifted.

Useful drift indicators vary by system. For a customer service agent: average resolution time (should be stable), escalation rate (should be below threshold), customer satisfaction sample (should be above threshold), first-response error rate (should be near zero). For a document extraction system: field completion rate, confidence score distribution, source citation accuracy rate.

The key is monitoring the leading indicators — the ones that move before the lagging indicators move. A system whose confidence distribution is shifting is going to start producing worse outputs before its error rate climbs. Catching the distribution shift lets you investigate and intervene before users experience the degradation.

The Certification Conversation

Certification has a social dimension that isn't captured in checklists.

The certifier isn't just reviewing artifacts. They're confirming that the people who built the system believe it's ready — and surfacing the concerns they haven't said out loud. An experienced certifier knows that the most important information in a certification review is not in the test results. It's in the pauses when they ask "is there anything about this system that you're uncertain about?"

Teams that have spent weeks building a system have a complex relationship with its readiness. They know its limitations better than anyone, but they also want it to ship. The certification conversation creates a structured space for honest disclosure: "There's this edge case we never resolved." "The scenario library doesn't cover this input pattern well." "We're not totally sure the model handles this correctly in all cases." These are the things that would have been caught in shadow mode — and the certifier's job is to ensure they're caught before shadow mode rather than during it.

For Tier 4 systems, the certification conversation includes the domain expert explicitly. The domain expert is not asked "does the code look right?" They're asked "would you be comfortable with this system advising on the cases we showed you?" That question produces different answers than "did it pass the test scenarios?"

The certification conversation is the last gate where human judgment — not metrics, not checklists, not automated evaluation — determines whether the system is ready. Preserving that judgment means keeping the conversation open, not just the checklist. The checklist answers the verifiable questions. The conversation answers the questions that can't be verified automatically.

What Maintenance Anti-Patterns Look Like

Teams that skip maintenance aren't careless. They skip it because maintenance feels low-value compared to building. No new feature at the end. No demo to show. No sprint velocity to report. Not glamorous work — foundation work. It prevents things from getting worse rather than making things visibly better. Organizations that optimize for visible output systematically underinvest in the layer that keeps the stack standing.

The result is a predictable pattern: systems that work well for their first six months, then degrade, then become unreliable, then become deprecated because nobody trusts them anymore. The failure mode looks like model obsolescence or product irrelevance, but the root cause is usually maintenance debt.

Anti-pattern 1: The "Set It and Forget It" Deployment. The system deploys. The team moves to the next project. Nobody is assigned to monitor the flywheel. The sampling runs, but nobody reviews the flagged outputs. The evaluation library stops growing because no new cases are added. Model updates happen silently. Six months later, a user reports a serious error — and the investigation reveals dozens of similar errors in the logs that nobody saw because nobody was looking.

The fix is assignment: someone owns the flywheel. Not as a second job — as a primary responsibility. The maintenance owner reviews the weekly flywheel reports, triages flagged outputs, updates evaluation rules, and runs the model change protocol when updates arrive. This is not glamorous work. It is the work that keeps the system reliable.

Anti-pattern 2: Ignoring the Flywheel Findings. The flywheel is running, the reports are being read, but the findings don't produce changes. The LLM-as-judge flags 12% of outputs as needing improvement. The maintenance owner reviews the flags, confirms they're real, and notes them — but doesn't update the spec, the harness, or the evaluation library. The flags continue. The 12% quality gap persists indefinitely.

Flywheel findings are inputs to the pipeline, not just observations. Every true positive in human review should produce a change: a spec update, a harness rule, a new evaluation scenario. If the flywheel is running and findings aren't producing changes, the flywheel is providing information without feedback — observation without correction.

Anti-pattern 3: Forgetting the Evaluation Library. The system launched with thirty behavioral scenarios. A year later, it's still running on thirty behavioral scenarios. The system handles ten thousand different input patterns. Thirty scenarios cover the original distribution. The gap between test coverage and production distribution has grown, and the maintenance team doesn't know where the uncovered territory is.

The evaluation library grows from production observations. Every flagged output that reveals a new failure pattern becomes a new scenario. Every human review that discovers a gap in the evaluation criteria produces a new variation. An evaluation library that doesn't grow is a library that's becoming less representative over time.

Anti-pattern 4: Manual Monitoring. The team doesn't have a flywheel — they manually sample outputs periodically. Once a week, someone looks at twenty outputs and checks if they "seem right." This is better than nothing. It catches catastrophic failures. It misses the subtle, systematic drifts that compound into significant problems.

The flywheel is not optional for Tier 3 and Tier 4 systems. Manual sampling at 10-20 outputs per week cannot detect a 3% quality decline in a system that processes a thousand outputs per day. The math doesn't work. Automated evaluation at scale is what catches the drifts that matter.

The Model Change Protocol

When the model provider updates their system — and they will, without warning — the maintenance protocol activates.

This is not hypothetical. A model that scored 97.6% accuracy in one evaluation dropped to 2.4% after a provider update. The provider didn't break the model — they optimized it for different criteria, and the system that depended on the old behavior collapsed. The model update that produces this kind of collapse is silent. It doesn't throw errors. The system continues to run, producing outputs with the same structure, at the same speed, that have become substantially wrong.

The model change protocol:

Detect: Monitor model version identifiers in API responses. Any version change triggers the protocol.
Re-run the full evaluation suite: Every behavioral scenario, every stress variation, every deterministic validation check — run against the new model version.
Compare to baseline: Every metric compared to the last certified baseline. Any metric that degrades more than 5% is a flag.
Decision:
- No metrics degraded: update the baseline, continue deployment.
- One or more metrics degraded 5-15%: investigate. The degradation may be acceptable given model improvements in other areas, or it may be a sign of a specific capability regression that needs to be addressed.
- Any metric degraded more than 15%: do not deploy the new version. Revert. Investigate. The model change broke something that needs to be fixed before the update ships.
Tier 4 addition: Full re-certification for any significant model update. Restart shadow mode. Domain expert review of any behavioral changes. No Tier 4 system deploys a model update without a human expert confirming that the behavior changes are safe in the domain.

The 5% threshold is calibrated conservatively. In a ten-step pipeline at 95% per step (after harness and evaluation), a 5% degradation in one step reduces end-to-end reliability from 60% to about 53% — a measurable, significant change. The threshold exists because small model changes can produce compounding effects in multi-step pipelines that look minor in isolation and significant in aggregate.

The Capability Horizon

MAINTAIN has a dimension that AOME (Agent-Oriented Management of Engineering) calls Capability Horizon: AI agent capability is doubling roughly every seven months.¹ Systems that were at the edge of what agents could reliably do when you built them will, eighteen months later, be well within what agents handle routinely.

This creates a specific maintenance responsibility: periodic re-evaluation of where human oversight is required. A Tier 4 system that required 100% human review of every output at launch may, two years later, be reliably accurate enough that the review burden could be reduced — not because the tier changed (the domain consequences are the same) but because the capability ceiling rose. The medication guidance system that required pharmacist verification on every response might, as models improve, produce outputs that are accurate enough to move to sampling-based verification rather than universal verification.

Capability Horizon monitoring is the structured practice of revisiting these thresholds. It asks: given what models can do today, are our oversight requirements still correctly calibrated? Or are we maintaining Tier 4 overhead on a system that now performs at Tier 3 reliability levels?

The protocol is conservative but deliberate:

1. Baseline re-run: Annually, re-run the complete evaluation suite against the current model version (which may have improved since launch).

2. Compare to tier thresholds: Where is the current performance relative to the certification thresholds? A system certified at 95% accuracy that is now performing at 99% has headroom.

3. Propose threshold adjustment: For systems with significant headroom, propose a reduction in oversight intensity. Not a tier change — the domain consequences are the same. But a change in the oversight mechanisms: from 100% human review to 25% sampling, or from supervised to auto-with-logging.

4. Domain expert sign-off: Any reduction in oversight for Tier 4 systems requires domain expert review. The expert confirms that the improvement is real, that it covers the cases that matter (not just the common cases), and that the proposed oversight reduction is safe.

5. Shadow the reduction before implementing it: Run the reduced-oversight configuration in shadow mode alongside the current configuration before switching. If the shadow configuration produces outputs that would have been flagged under the current configuration, reconsider.

Capability Horizon monitoring is the practice that prevents MAINTAIN from becoming a permanent, ever-increasing overhead. As models improve, systems can operate more reliably with less intervention. The MAINTAIN flywheel should become less expensive over time, not more expensive — as long as the Capability Horizon is actively monitored and oversight thresholds are adjusted when the evidence supports it.

What MAINTAIN Connects To

MAINTAIN is not the end of the pipeline. It's the loop that connects back to the beginning.

When the flywheel identifies a pattern of failures, the investigation might reveal a spec gap — a behavior the system should have but didn't because the spec didn't describe it. The fix goes back to SPEC: the spec is updated, the test scenarios are updated, the BUILD phase produces an update, and the update goes through CERTIFY before it reaches production.

When drift detection identifies that the system is optimizing for the wrong thing — that the intent drift indicators are moving — the fix goes back to the intent contract. The cascade of specificity is revisited. The trade-offs are re-examined. The updated intent contract goes through the same review process as the original.

When the model change protocol identifies a regression, the fix might be a harness update — a new deterministic validation rule that catches the failure mode the new model introduced. Or it might be a spec update that disambiguates an instruction the old model interpreted correctly and the new model interprets differently.

Every loop through MAINTAIN produces either: confidence that the system is working correctly (the inputs come back with no failures), or specific improvements to the spec, harness, intent contract, or evaluation library. The system gets more reliable with time — not because AI improves on its own, but because the human-managed infrastructure around it improves with every cycle.

This is the flywheel: each cycle of observation, evaluation, and correction improves the infrastructure. The infrastructure improvement reduces the failure rate. The lower failure rate means fewer cycles end in corrections. The system becomes lower-maintenance over time because its foundation is more solid.

MAINTAIN is the phase that makes long-term reliability possible. Teams that skip it are borrowing against a debt that compounds silently, then surfaces as an incident that requires emergency intervention. Teams that run it consistently build systems that become more reliable with age.

The Last Mile Is the Longest

Certification, deployment, and maintenance together constitute "the last mile" of the pipeline — the work that happens after the system is built but before (and long after) it's serving users.

The framing of "last mile" is borrowed from logistics, where the final stage of delivery — the last mile from distribution center to customer — is consistently the most expensive and most operationally complex part of the supply chain. The last mile in AI agent deployment is similar: it's not technically the hardest (that's the spec and the harness), but it's operationally the most demanding and the most underinvested.

The reason is temporal. BUILD and TEST happen once, with clear deliverables and an end date. CERTIFY, DEPLOY, and MAINTAIN happen once and then continuously, with no end date, no clear stopping criteria, and no moment of completion. The people who built the system move on to the next project. The system continues to run. The flywheel needs someone to keep it turning.

Organizations serious about AI reliability invest in the last mile explicitly. They staff MAINTAIN as a function, not a responsibility that falls on the original build team. They budget for quarterly intent reviews. They include model change protocol time in project planning. They treat certification as a real gate — not a formality before the launch announcement.

Organizations that treat the last mile as overhead — something to minimize and eventually hand off — find themselves rebuilding systems that degraded because nobody was maintaining them. The rebuild costs more than the maintenance would have. The incidents that prompted the rebuild cost more still.

The pipeline doesn't end at deployment. It loops. The methodology's eight phases are not a waterfall. They're a cycle. MAINTAIN feeds back into SPEC. SPEC feeds into BUILD. Every loop produces a more reliable system, as long as someone is turning the flywheel.

Part II is done. Part III goes under the pipeline — the harness, the intent layer, and the simplicity that holds the whole stack together.

METR (Model Evaluation & Threat Research), capability doubling findings. The ~7-month AI capability doubling figure is derived from METR's longitudinal benchmark tracking of autonomous task completion. [VERIFY — confirm specific METR report, publication date, and exact doubling interval figure] ↩

← All chapters