Appendix

Appendix

C

AOME Metrics Quick Reference

Before writing this appendix I want to acknowledge something: AOME is not a published, peer-reviewed framework. It's a working name for a set of metrics I've been using across VZYN Labs, SonIA, Edifica, and the Regasificadora project to answer a question that DORA can't: when the developer is running a fleet of agents, what does "productivity" even mean? The framework draws on Grove's High Output Management (1983), DX Core 4, the METR productivity research, and the Mount Sinai factorial study. The metrics below are what I measure. The baselines are what I've seen in practice on my own projects. Your numbers will differ.

With that said, the gap these metrics fill is real. Teams adopting AI tools report 10–15% productivity gains on paper and 19% slower task completion on complex real-world work in the METR RCT. DORA shows deployment frequency up and lead time up at the same time. Something is being measured wrongly. AOME is an attempt to measure the right thing.

Why DORA Cracks

DORA measures deployment frequency, lead time, change failure rate, mean time to recovery, and reliability. All five are system-level outputs of a software delivery pipeline. They were built for a world where the unit of production was "engineer writes code → code moves through pipeline → deployment happens." In that world, increasing velocity was the goal and DORA told you whether you were getting there safely.

In the agent era, every one of those metrics is distortable by AI without quality improving. Agents can push 10× more PRs, inflating deployment frequency. Review time goes up because humans now have to verify what agents produced, inflating lead time. Change failure rate dips in the short term because simple changes succeed more often, then spikes later when cumulative spec drift surfaces. DORA still detects the shape of the pipeline; it no longer explains why the shape is what it is.

SPACE is more durable because it was always a model rather than a metric set. Its dimensions — satisfaction, performance, activity, communication, efficiency — still apply. But SPACE treats the human as the subject of measurement. In the agent era, the subject is the orchestration layer.

DevEx is the most prescient because it recognized cognitive load as a first-class variable. AI tools don't reduce cognitive load; they shift it. Implementation effort down, supervision and review effort up. DevEx measures the right axis but needs new vocabulary for the specific kind of load agents introduce.

AOME doesn't replace any of these. It sits on top. You still track DORA for pipeline health. You still track SPACE for team health. You add AOME to track the thing that's actually driving both.

The Five Dimensions

DimensionWhat It MeasuresGrove EquivalentStarter Metric
Fleet OutputWhat the agent fleet producesOutput of the factoryFeatures passing behavioral scenarios per sprint
Orchestration QualityHow effectively the team directs agentsManager's leverageSpec completeness score (8/8 sections, ambiguity count)
Capability HorizonWhat agents can do now vs. six months agoTraining investmentNew task categories successfully delegated to agents
Escalation HealthWhether failures surface cleanlyQuality indicators% of escalations resolved vs. % buried in output
Context IntegrityWhether specs and docs stay currentInformation flowSpec-to-code drift score (days since last sync)

Fleet Output

The output of the team, not the individual. Grove: a manager's output equals the output of their team. In the agent era, this is literal — the developer's output is what the fleet ships.

  • Primary metric: Features passing behavioral scenarios per sprint. Not PRs, not commits, not lines of code. A feature counts when it passes its specified scenarios. A PR that ships code but breaks scenarios is negative output.
  • Secondary metric: Spec-backed output ratio. Percentage of shipped features traceable to a specification section. Features shipped without a spec are liabilities, not output.
  • Baseline: A one-developer Dark Factory project with a mature harness, on my own work, runs 8–15 spec-backed features per two-week sprint on Tier 2 systems. Tier 4 drops to 2–4. Your baseline should be established over three sprints before you change anything.
  • How to measure: Tag every shipped feature with the spec section it implements. Run the eval suite at sprint boundary. Count passes.

Orchestration Quality

How well the human directs the fleet. This is the metric that replaces individual developer velocity.

  • Primary metric: Spec completeness score. The harness runs the 8-section check and the ambiguity-word scan (should, ideally, try to, usually, when possible) on every spec before BUILD. Score is 8/8 sections present, zero ambiguity words. Anything less is a drag on orchestration quality.
  • Secondary metric: Eval pass rate on first generation. The percentage of agent runs that pass behavioral scenarios without a rework cycle. Low pass rates indicate spec ambiguity more often than model failure.
  • Tertiary metric: Harness-block rate. The fraction of agent runs that the deterministic harness catches and returns for rework before a human sees them. A healthy harness-block rate is non-zero (the harness is doing its job) but trending down (specs and prompts are improving).
  • Baseline: On my projects, first-generation eval pass rate sits around 70–85% on Tier 2 systems with mature specs. Below 50% consistently means the spec is ambiguous and you're paying for it in review time.
  • How to measure: Instrument the pipeline. Every agent run has a spec input, a scenario suite, a result. Log the three numbers per run.

Capability Horizon

The maximum complexity of task that agents in your pipeline can reliably complete autonomously. METR's "time horizon" metric. This is the slowest-moving dimension and the one that most reveals whether investment in the methodology is paying off.

  • Primary metric: New task categories successfully delegated in the last quarter. At the start of VZYN Labs, "run a full SEO audit" was a task I had to decompose and supervise. Three months later, it was a single playbook invocation. That transition is capability horizon expansion.
  • Secondary metric: Autonomy grant ratio. Percentage of task types running in "autonomous" mode vs. "supervised" vs. "shadow." As a task earns trust through scenarios passing and production samples clean, it progresses from shadow → supervised → autonomous. The progression is the metric.
  • Baseline: In a mature Dark Factory pipeline, expect 3–5 task types to move up the autonomy ladder per quarter. If nothing moves, either you're not investing in eval coverage or your specs aren't improving.
  • How to measure: Maintain an autonomy ledger per task type. Record the date each task moved between levels and the evidence that triggered the move (typically a certain number of clean runs over a certain window).

Escalation Health

Whether failures surface where humans can see them, or get buried in plausibly-correct output.

  • Primary metric: Escalation surface rate. Of the failures a post-hoc audit identifies, what percentage did the agent surface at the time (flagging, requesting human review, declining to act) vs. what percentage it silently resolved wrong? This is the single most important AOME metric for Tier 3-4 systems.
  • Secondary metric: Blocker dwell time. When an agent spins in a failure loop, how long before a human notices? Measured from first failed run to first human intervention. Long dwell time is the Bainbridge irony of automation playing out in real time.
  • Tertiary metric: Override rate. How often humans override the agent's output at review. A very low override rate combined with rising post-deployment bug reports means reviewers are rubber-stamping. A very high override rate means the agent is guessing.
  • Baseline: For a Tier 4 system, escalation surface rate should be above 95% — the agent surfaces almost every real failure. For Tier 2, 80% is tolerable. Below that in any tier, the harness needs more explicit escalation criteria in the spec.
  • How to measure: Sample production outputs at the tier-appropriate rate (10% for Tier 2, 25% for Tier 3, full coverage for Tier 4). Human auditor classifies each as correct, incorrect-and-surfaced, or incorrect-and-buried. The third category is the failure.

Context Integrity

Whether the specifications, discovery documents, and intent contracts stay current as the system evolves.

  • Primary metric: Spec-to-code drift score. Days since the spec was last updated against a codebase that has been updated since. Every commit to production code without a corresponding spec commit accrues drift. The metric is a running clock.
  • Secondary metric: Drift-detection lead time. When a drift does occur, how long before the harness or an audit catches it? Low lead time (hours) is healthy. High lead time (weeks) means spec and code have diverged and the next significant agent run will be working from a stale mental model.
  • Tertiary metric: Intent freshness. For Tier 3-4 systems with intent contracts, how recently was the intent contract reviewed against actual production decisions? Intent drift is slower and more insidious than spec drift.
  • Baseline: Hernan's Edifica drift incident (fifteen undocumented changes in a week, described in Chapter 7) was a drift score of around 10 days when caught. Target on active systems: under 3 days.
  • How to measure: Git-level. Timestamp spec file vs. timestamp of code files in the same domain. The delta is the drift. Automate it. The whole point is that humans forget.

Relationship to Existing Frameworks

FrameworkStill Useful ForBreaks On
DORAPipeline speed baselineAI inflates speed metrics while quality degrades
SPACETeam health dimensionsActivity metrics inflate with AI-generated output
DevExCognitive load awarenessAI shifts cognitive load (creation → supervision), doesn't reduce it
AOMEAgent-era orchestration— (built for current reality)

Treat AOME as an overlay. DORA still tells you whether the pipeline is healthy. AOME tells you whether the orchestration producing the pipeline output is healthy. Both are necessary.

Getting Started

A starter dashboard for a team adopting AOME:

  1. Pick two dimensions to measure this month. Fleet Output and Context Integrity are the strongest starting pair — the first tells you what you're shipping, the second tells you whether the system is eroding underneath.
  2. Establish baselines before changing anything. Run three sprints with measurement only, no intervention. Without a baseline, every subsequent number is compared against your expectations rather than your history.
  3. Add dimensions as maturity grows. Orchestration Quality in month two. Escalation Health in month three (requires sampling infrastructure). Capability Horizon last, because it's the slowest-moving and most easily confused with noise.
  4. Never measure all five without first measuring one well. A metric that exists but isn't trusted is worse than no metric. Pick one, instrument it honestly, defend the number for a sprint, then add the next.

What This Can't Do

AOME doesn't tell you why a number moved. A drop in first-generation eval pass rate could be spec ambiguity, a model update, a new domain, or a reviewer raising the bar. The metrics give you signals, not diagnoses. Treat them the way a manager treats quarterly reports — they tell you where to look, not what to fix.

And AOME doesn't capture culture, motivation, or the judgment developers apply to decide what to build. Those live in SPACE and in the unquantifiable parts of engineering leadership. The agent era doesn't eliminate those; it just stops pretending that activity metrics substitute for them.

The five dimensions above are what I measure. The baselines above are what I've seen. Run it on your own work for three sprints and the numbers will tell you more than any framework write-up, including this one.