Part The Shift

Chapter

DORA Is Cracking

A randomized controlled trial published in 2025 found that experienced open-source developers using AI tools completed tasks 19% slower than developers without AI tools.¹

Read that again. Slower. Not faster. Not the same. Slower.

The study, conducted by METR (Model Evaluation & Threat Research) with rigorous RCT methodology — randomized assignment, experienced developers, real-world tasks — demolished the assumption that AI coding assistants automatically improve developer productivity. On complex tasks requiring deep codebase understanding, the overhead of managing AI output exceeded the time saved by generating it.

I didn't need the study to believe this — I understood it the first time I watched a developer spend thirty minutes debugging AI-generated code they could have written correctly in ten. But the study should alarm anyone still using DORA to measure a team in the AI era, because DORA wouldn't catch it. Deployment frequency goes up. Lead time goes down. The dashboard says: productivity up. Reality says: quality down, cognitive load up, and your best developers burning their hours reviewing agent output instead of solving hard problems.

DORA is cracking.

How the Industry Responded

The METR finding was not welcome news.

The dominant narrative in 2024 and 2025 was that AI coding assistants provided productivity gains in the range of 20-55%. Those numbers came primarily from internal studies conducted by companies with a commercial interest in the result. GitHub's internal study of GitHub Copilot found a 55% productivity increase.² Google's study of its internal AI tools found similar figures. These studies were not methodologically fraudulent — but they had a selection bias problem: they measured tasks that AI was good at (writing new code from clear specifications), in environments optimized for AI assistance, run by teams who wanted the tools to succeed.

METR's study controlled for these factors. It used real-world maintenance tasks on established codebases — the kind of work that dominates a senior developer's week, not the new-feature work that AI handles well. It randomized assignment. It measured actual time-to-completion, not self-reported productivity estimates.

The result was 19% slower. On complex, real-world tasks, AI assistance was a net drag on the developers who were supposed to be its primary beneficiaries.

The industry responded in three ways. Some dismissed the finding as a methodological artifact. Some argued the sample size was too small (49 developers). Some quietly began shifting their productivity claims from "developers go faster" to "developers at any skill level can now contribute" — a different claim, less falsifiable, and considerably less interesting to the CFO who approved the Copilot Enterprise license.

Almost nobody asked the harder question: if AI assistance slows experienced developers on complex tasks, what is our productivity framework actually measuring?

For the past decade, developer productivity has been measured by three frameworks.

DORA (2014) measures the delivery pipeline: how often you deploy, how fast changes reach production, what percentage of deployments fail, and how quickly you recover. It's quantitative, rigorous, and the de facto standard for engineering leadership. DORA's power is its simplicity — four metrics that correlate with business outcomes. Elite performers deploy on demand, recover in under an hour, and maintain change failure rates below 5%. The framework doesn't tell you how to achieve that. It tells you whether you did.

SPACE (2021) broadened the lens to five dimensions: satisfaction, performance, activity, communication, and efficiency. It acknowledged that DORA's pipeline metrics missed the human side — burnout, collaboration, satisfaction. SPACE recognized that a team shipping fast but burning out was not an elite team. It was a team about to crater.

DevEx (2023) went deeper into the daily experience: feedback loop speed, cognitive load, and flow state. It's the most prescient of the three — cognitive load and flow are exactly the dimensions AI disrupts most. DevEx was trying to measure the lived experience of being a developer, not just the output artifacts.

All three were designed for a world where humans write code and machines run it. That world is ending.

What Breaks

DORA breaks on speed. When AI generates code, deployment frequency and lead time improve — you ship more, faster. But the Uplevel study of 800 developers found AI assistants produced 41% more bugs with negligible speed gain.³ The GitClear analysis showed code churn doubled, refactoring dropped from 25% to below 10%, and duplicate code increased eightfold.⁴ DORA sees faster deploys. Production sees more incidents.

The mechanism is straightforward: AI generates code quickly, and developers ship it quickly, and DORA records a deployment. What DORA doesn't record is that the code was generated without deep understanding of the system it was integrated into, that the tests were also generated by AI and test for the wrong things, and that the change failure rate — which should be climbing — is masked by rapid hotfixes that are themselves AI-generated and themselves poorly understood.

Elite DORA performers in the AI era may be elite at the wrong things.

SPACE breaks on activity. Activity metrics — PRs merged, commits per day, lines of code — inflate dramatically with AI assistance. The SonarSource survey found 42% of committed code is AI-assisted.⁵ A developer who used to merge three PRs per week now merges eight. SPACE sees high activity. The codebase sees bloat.

There's a subtler problem: SPACE's satisfaction dimension, which should catch problems that activity metrics miss, is paradoxically high among AI tool adopters — at least initially. People enjoy the novelty. The velocity feels good. The dashboard turns green. It takes months for the downstream effects — the accumulating technical debt, the test suite that passes but doesn't protect, the architectural decisions made by an agent that had no architectural intent — to surface in the satisfaction scores. By then, the team has committed to the pattern.

DevEx breaks on cognitive load. The UC Berkeley ethnographic study found that AI doesn't reduce cognitive load — it intensifies it.⁶ Developers now manage two cognitive tasks simultaneously: their own problem-solving and the supervision of AI output. The HBR/BCG study found 17.8% of engineers experience what researchers termed "AI brain fry"⁷ — cognitive overload specifically from managing AI tools. DevEx measures cognitive load as something to minimize. But in the AI era, cognitive load shifts rather than disappears — from creation to supervision.

Supervision is cognitively expensive in ways that creation is not. When you write code, you understand what you wrote. When you supervise AI-generated code, you must maintain a dual awareness: does this code do what I think it does, and does what I think it does actually match what the system needs? The second question requires continuous context-loading — understanding the system, the requirements, the constraints — that the AI bypassed when it generated the code.

DevEx measures how burdensome the tools are. It doesn't measure whether the supervision burden those tools create is sustainable.

The Signal in the Sentiment Data

The developer sentiment data tells a story that the productivity frameworks miss entirely.

The Stack Overflow 2025 survey found that only 29% of developers trust AI tools to generate correct code.⁸ That number is striking for two reasons. First, it means 71% of developers working with these tools are doing so without trusting their output — every output requires verification. Second, it hasn't changed much in two years of rapid capability improvement. Developers became more skeptical as the tools became more capable, because more capable tools produce more plausible-looking wrong answers.

20% of developers reported a loss of confidence in their own technical skills after adopting AI tools. This isn't false modesty — it reflects a real phenomenon. When you offload enough of the mechanical coding work to AI, you lose the feedback loops that build and maintain technical intuition. A senior developer who spends a year supervising AI output rather than writing code discovers that their mental model of the system has gaps. The AI was filling in the gaps they used to fill themselves.

17.8% report "AI brain fry" — cognitive overload specifically attributed to AI tool management. This is the DevEx signal, but it's the signal DevEx wasn't designed to detect: not that the tools are hard to use, but that using them correctly is exhausting in ways that using traditional tools was not.

The METR finding and the sentiment data together tell a coherent story: AI assistance is most beneficial on tasks it handles well (well-specified new code) and most costly on tasks where its output requires skilled supervision (complex existing systems). The productivity gains go to the work that matters less. The productivity costs concentrate in the work that matters most.

DORA, SPACE, and DevEx don't distinguish between these task types. They measure outputs, not fit-for-purpose.

What the Survey Data Actually Says

The productivity studies — METR's RCT, the Uplevel bug-rate analysis, the GitClear code quality data — are the quantitative case. The developer survey data adds a qualitative layer that the metrics miss: how it actually feels to work this way.

The Stack Overflow 2025 developer survey asked developers about their trust in AI-generated code. The answer — 29% trust AI tools to generate correct code — has a particular texture when you read the follow-up questions. It's not that developers think AI is useless. They use it extensively. It's that they've internalized the verification burden. "Trust" in this context means trust without verification. And experienced developers have learned that trust without verification is not warranted — not because AI tools are bad, but because the failure modes are subtle enough that verification is always necessary.

That 71% verification rate has a cost. Every output that requires verification adds a context-switching overhead: load the output into working memory, reason about it from first principles, compare it against what the system needs, decide whether to accept, revise, or reject. That cognitive sequence is expensive in a way that writing code is not, because writing code is a productive cognitive state — you're building something — while verification is a critical cognitive state — you're finding problems. Humans are slower at finding problems than at building things, and the psychological cost is higher.

The 20% confidence loss finding is the most underappreciated number in the developer sentiment data. Developers who adopted AI tools early and used them extensively reported losing confidence in their own technical skills. Not all of them — but one in five. These aren't junior developers who never had confidence. The studies primarily survey experienced developers. These are people who spent years building technical skills, who offloaded the maintenance and exercise of those skills to AI tools, and who found those skills atrophying faster than they anticipated.

This is the developer version of the calculator dependency concern. A generation trained to calculate in their heads loses that ability when calculators do it for them. The concern about calculators turned out to be largely overstated — the skills you lose are the skills you no longer need. But the concern about AI coding tools may be differently founded, because the skills you lose — architectural judgment, deep system understanding, the ability to hold a complex codebase in working memory — are the skills that distinguish good engineers from average ones. They're also the skills that allow you to supervise AI output effectively. Lose them, and you lose the judgment that makes you a reliable orchestrator.

The 17.8% "AI brain fry" number captures something the other metrics miss: not that the tools are used too much, but that using them correctly is itself exhausting. The developers experiencing brain fry aren't the ones who use AI carelessly — they're the ones who use it carefully, who maintain dual awareness, who verify everything, who carry the cognitive burden of being a reliable human checkpoint in a system that produces unreliable outputs. The burden of careful AI use is higher than the burden of no AI use. The productivity gains, when they exist, must offset a cognitive tax that the frameworks don't measure.

The Paradigm Inversion

Here's what the frameworks miss: developers are becoming managers.

Not managers of people — managers of agent fleets. They assign tasks, review output, catch errors, give feedback, decide when to escalate. Those are management functions. Andy Grove described them in High Output Management:⁹ define the task, allocate resources, monitor execution, evaluate output, iterate.

But no productivity framework measures management quality. DORA measures pipeline speed. SPACE measures team health. DevEx measures individual experience. None of them measure: how effectively does this person orchestrate AI agents to produce reliable software?

A junior developer I spoke with — Joen, who learned to code inside a San Francisco company that mandated Claude usage eight months ago — said it cleanly: "Juniors implement AI harder than any senior." I realized he was right. The juniors have tool fluency. They think in modes (ask vs. agent vs. search). They carry no identity attachment to manual coding that slows adoption. But they also don't have the domain knowledge, the architectural judgment, or the experience to evaluate whether the output is correct.

The seniors have the judgment but resist the tools. The juniors have the tools but lack the judgment. Neither is measured by the frameworks we have.

Joen's situation is representative of a generation: technically capable, AI-fluent, underemployed. His skills don't map to the job descriptions written for a pre-AI world. The companies hiring are still looking for evidence of what Joen can do without AI. The companies that will win are the ones who figure out how to measure what he can do with it.

The Measurement Trap

Goodhart's Law: when a measure becomes a target, it ceases to be a good measure.

DORA metrics were designed to track the outcomes of good engineering practices. Frequent deployments, short lead times, low change failure rates — these correlate with high-performing teams because high-performing teams build systems that make frequent, safe deployment easy. The correlation exists because the underlying practices create both the metric and the outcome.

AI breaks the correlation. AI tools allow teams to increase deployment frequency and decrease lead time without the underlying practices that make those metrics meaningful. A team can deploy AI-generated code twelve times per day with zero understanding of what they're deploying. DORA sees elite performance. The codebase accumulates debt at elite speed.

When an organization uses DORA to evaluate teams in the AI era, it isn't just measuring the wrong things — it's creating incentives to optimize the wrong things. Teams that want to score well on DORA will use AI to ship faster, because AI makes shipping faster easy. The metric designed to reward good engineering practice becomes a lever for producing the appearance of good engineering practice while the software quietly degrades underneath.

This is not a hypothetical. The GitClear analysis found that AI-assisted codebases showed exactly the pattern you'd expect from Goodhart optimization: high activity, accelerating churn, declining refactoring, increasing duplication. Teams shipping more, maintaining less. DORA green. Codebase red.

The measurement trap matters because organizations make hiring, investment, and promotion decisions based on these metrics. A team lead who uses AI to boost DORA numbers gets credit for high performance. A team lead who uses AI to improve code quality, reduce technical debt, and build reliable systems — work that doesn't show up in DORA — gets measured as average. The incentive structure selects for the wrong kind of AI adoption.

AOME: Agent-Oriented Management of Engineering

What's needed is a framework built for the world as it is — where developers manage agent fleets, not just codebases.

Five dimensions, adapted from Grove's management primitives:

Fleet Output — What does the agent fleet produce? Not lines of code, but working features that pass behavioral scenarios. Measure the output of the system, not the activity of the humans.

In practice, this means tracking the percentage of agent tasks that complete without human intervention, the rate at which completed tasks pass quality review, and the trend over time. A team whose fleet output improves month over month is building better agents. A team whose fleet output is flat is running the same agents on new problems. A team whose fleet output is declining has a harness or evaluation problem.

Orchestration Quality — How effectively does the team direct agents? This is the new core skill — specification quality, prompt precision, context management. A team that writes better specs gets better output from the same models.

Orchestration quality is measurable. How often does an agent complete a task on the first instruction versus requiring multiple correction rounds? How often does the spec change significantly during execution — indicating that the original spec was ambiguous? How often does the agent escalate to a human for a decision that should have been specified in advance? These ratios are signals of spec quality. Poor specs create poor orchestration. Good specs create efficient execution.

Capability Horizon — METR tracks this: AI agent capability is doubling roughly every seven months.¹⁰ What can your agents do today that they couldn't do six months ago? Are you expanding the boundary, or are you using last year's tools on last year's problems?

This dimension rewards investment in evaluation and spec infrastructure. Teams that build good eval frameworks can detect capability improvements as new models ship. Teams without eval infrastructure don't know when their agents get better — they just know the outputs are "still pretty good." Capability horizon is the compounding advantage: teams that expand it systematically pull away from teams that don't.

Escalation Health — When agents hit their limits, do they escalate cleanly? Or do failures get buried in output that looks correct but isn't? Healthy escalation means the human-in-the-loop touchpoints are working — the agent knows when to stop and the human knows what to check.

Escalation health inverts a common misconception about AI autonomy. Most teams optimize for agents that escalate less — less human intervention, more automation, lower overhead. That's the wrong optimization. The goal is agents that escalate correctly: that recognize the boundary of their reliable operation and hand off cleanly when they cross it. An agent that completes tasks autonomously 95% of the time and escalates confidently and accurately 5% of the time is far more valuable than an agent that completes tasks 99% of the time but buries the 1% that went wrong in plausible-looking output.

Context Integrity — Does the specification, discovery document, and intent contract remain accurate over time? Spec drift, stale documentation, and context rot are the silent killers of agent productivity. Context integrity measures whether the information the agents work from is current.

This dimension captures something no prior framework tracked: the quality of the information environment in which developers and agents work. In a traditional development team, documentation drift was a nuisance — humans could work around stale docs by asking colleagues. In an agent-led team, stale context is a failure mode. Agents work from what they're given. If what they're given is wrong, they produce wrong results confidently.

Context integrity requires treating specifications and discovery documents as living artifacts — versioned, reviewed, and updated when the system changes. Teams that neglect this find their agents becoming less reliable over time not because the agents degraded, but because the world the agents were trained on diverged from the world they're operating in.

AOME in Practice

What does it actually look like when a team adopts AOME thinking?

The conversations change first. Instead of sprint reviews focused on story points and velocity, the team asks: what percentage of our agent tasks completed cleanly this sprint? Where were the escalations, and what patterns do they reveal about spec gaps? What's the current fleet output ratio, and is it improving?

These questions sound abstract until you're running them weekly. In practice, they surface things that sprint velocity never caught: a spec that kept generating escalations at the same decision point (spec gap — the behavioral contract was missing a rule); an agent task that completed cleanly all sprint but produced outputs that failed downstream quality review (evaluation gap — the check was running but measuring the wrong dimension); a new model version that shipped mid-sprint and shifted fleet output by 8% without warning (capability horizon event — requires baseline audit, not just monitoring).

The hiring criteria change. A developer who can write precise behavioral specifications — who can look at a system and describe what it should do precisely enough that an agent can implement it without clarifying questions — is more valuable than a developer who can write the code themselves. This is an unfamiliar criterion. Most technical interviews are designed to assess implementation skill. AOME thinking asks for something different: the ability to encode intent with sufficient precision that a non-human executor can act on it reliably.

In practice, this means adding a specification exercise to the interview process. Give a candidate a system to analyze and ask them to write the behavioral contract for a specific agent task. Can they identify what the agent should do, what it should refuse, when it should escalate, and how success should be measured — without the code? The candidates who can are rare and valuable. The ones who can't but are strong implementers are not poorly skilled; they're skilled in a way that AI is increasingly capable of substituting for.

The investment priorities change. Teams shift from spending on more models to spending on better harness infrastructure. From adding capabilities to evaluating what they have. From building new agents to making existing agents reliable. The compound math from chapter 2 applies here: a team that invests in harness and evaluation gets compounding returns. Each percentage point of reliability improvement multiplies across every pipeline the harness serves.

What teams stop measuring matters as much as what they start measuring. Commit counts, PR volume, and velocity points measure activity, not value. In a world where AI can generate both good and bad code at high volume, activity metrics actively mislead. AOME teams stop rewarding activity and start rewarding outcomes: working software, reliable agents, accurate specs.

The Career Measurement Gap

AOME has a career consequence that gets overlooked in the metrics discussion.

The skills AOME rewards are not the skills the job market has learned to identify. Technical interviews are designed around implementation: write this function, debug this code, explain this algorithm. Those skills remain valuable, but they're no longer the limiting factor. The limiting factor is orchestration quality — the ability to write a behavioral specification precise enough that an agent can implement it without asking clarifying questions.

This is not a skill that shows up in a LeetCode score. It doesn't appear on a GitHub profile — at least not in any form that current automated screening tools recognize. A developer with extraordinary specification skill who orchestrates agents to build complex, reliable systems might have fewer commits than a developer who writes everything manually. Their commit history tells the wrong story. Their test coverage might be identical — if they built the evaluation layer well, the tests are there; they just weren't all written by hand.

The career measurement gap creates a specific injustice for developers who adapted early and adapted well. Joen is the case in point: he learned agent-oriented development inside a company that mandated it, built real orchestration skill, and now faces a hiring market asking him to demonstrate implementation skill he's intentionally offloaded to agents. He's skilled at the thing that will matter in five years, and undervalued by the tools that measure the thing that mattered five years ago.

Engineering leaders making hiring decisions in the AI era should be asking different questions: Can this person write a specification that an agent can execute? Can they design an evaluation suite that catches what the agent misses? Do they understand the compound failure math well enough to know when a system needs a harness? These questions don't have standardized assessments. They require the kind of judgment-based interview that most organizations have moved away from in favor of algorithmic screening.

The organizations that figure out how to identify and retain orchestration talent will build more reliable systems faster. The ones that keep hiring for implementation skill and measuring with activity metrics will build faster and worse — and won't know it until the DORA dashboard turns red.

DORA's Useful Remnant

None of this means DORA is useless.

DORA's pipeline metrics still measure what they always measured: how fast working software gets from commit to production, how often it fails, and how quickly you recover. Those are still worth tracking. A team with poor DORA metrics has a delivery problem. A team with elite DORA metrics in the AI era might have a quality problem, a spec problem, or a context integrity problem — none of which DORA catches.

AOME doesn't replace DORA. It complements it. DORA tells you how the pipeline runs. AOME tells you whether the people operating the pipeline are doing so effectively in a world where the pipeline runs on AI.

A practical hybrid looks like this: keep DORA's four metrics as pipeline health indicators, and layer AOME's five dimensions as team capability indicators. When DORA turns red, investigate the pipeline. When AOME's fleet output declines while DORA stays green, investigate the specs. When orchestration quality drops while fleet output holds, investigate whether the team's spec-writing skill is keeping pace with the complexity of what they're building.

The combination answers the question that neither answers alone: is this team building reliable software with AI assistance, at a sustainable pace, in a way that will be defensible when the next audit happens, the next incident occurs, or the next model version ships?

The Broader Pattern

The three frameworks — DORA, SPACE, DevEx — didn't fail because they were poorly designed. They failed because the world they were designed for changed faster than measurement frameworks typically evolve.

Measurement frameworks encode assumptions about how value is created. DORA assumes value is created by shipping working software frequently and recovering from failures quickly. SPACE assumes value is created by teams of satisfied, high-performing, well-coordinated individuals. DevEx assumes value is created by developers with low cognitive load and high flow state.

All of these assumptions were reasonable for teams of humans writing code. None of them adequately model teams of humans orchestrating agents to write code. The assumptions about where the cognitive work happens, where the quality decisions live, where the failures originate — all of these shift when the primary production unit changes from "a developer writing a function" to "a developer specifying a behavior and an agent implementing it."

What's needed is not just new metrics but a new mental model: the developer as manager of production capacity, not as direct producer of artifacts. This mental model makes AOME's dimensions intuitive — of course you'd measure fleet output, orchestration quality, and escalation health if you understood your job as managing a fleet of agents rather than writing code. The metrics follow from the model.

The organizations that make this mental shift fastest — that stop measuring developers as code writers and start measuring them as orchestrators — will build the institutional capability to use AI reliably at scale. The ones that don't will keep shipping faster and worse, DORA dashboards glowing green all the way down.

The frameworks we inherited weren't designed to answer the question that now matters. The next one starts where this one lands: once you accept developers are orchestrating agents, trust stops being a posture and becomes an architectural decision — tiered, enforced, and scaled into the stack.

METR (Model Evaluation & Threat Research), "Measuring the Impact of Early AI Coding Tools on Experienced Open-Source Developer Productivity" (2025). Available at metr.org. Randomized controlled trial: 49 experienced open-source developers, real-world maintenance tasks on established codebases, randomized assignment to AI-assisted and unassisted conditions, time-to-completion measured objectively. ↩
Sida Peng et al., "The Impact of AI on Developer Productivity: Evidence from GitHub Copilot," arXiv:2302.06590 (2023). The commonly cited 55% figure refers to task completion speed on a controlled coding task. Note: figures from vendor-sponsored studies reflect conditions favorable to AI assistance (new-feature writing on isolated tasks). ↩
Uplevel, "AI Coding Tools Study: Developer Productivity and Code Quality" (2024). Analysis of 800 developers across enterprise codebases. [VERIFY exact title and publication URL at uplevelteam.com] ↩
GitClear, "Coding on Copilot: 2024 Data Suggests Downward Pressure on Code Quality" (2024). Analysis of code churn, refactoring rates, and copy-paste duplication across AI-assisted codebases. Available at gitclear.com. [VERIFY URL] ↩
SonarSource, "State of Code 2024 Report." [VERIFY exact title and URL at sonarqube.com or sonarcloud.io] ↩
[SOURCE — UC Berkeley ethnographic study on developer cognitive load under AI assistance; citation needed. Confirm study title, authors, and publication year.] ↩
The "AI brain fry" statistic (17.8%) is attributed to research on knowledge worker cognitive load under AI tool use. Likely sourced from: Fabrizio Dell'Acqua et al., "Navigating the Jagged Technological Frontier," Harvard Business School Working Paper 24-013 (September 2023), or a related follow-on study. [VERIFY — confirm the 17.8% figure against this specific paper; the Dell'Acqua et al. study is real but the exact statistic requires checking] ↩
Stack Overflow, "2025 Developer Survey." Annual survey of approximately 65,000 developers worldwide. Available at survey.stackoverflow.co. ↩
Andy Grove, High Output Management (Random House, 1983). Grove's framework for measuring managerial output — defining tasks, allocating resources, monitoring execution, evaluating results — is the source for AOME's management primitive mapping. ↩
METR capability research published via metr.org. The ~7-month doubling estimate refers to METR's ongoing autonomous task evaluations tracking AI capability improvements over time. [VERIFY latest published estimate; this figure evolves as new models are evaluated] ↩

← All chapters