Vendor case studies are useful data points. They are not evidence. When OpenAI reports that Codex users inside OpenAI merge 60% more PRs per week, you should note the context: 95% of OpenAI developers use Codex internally, which may reflect organisational expectation as much as organic productivity gain.
For evidence, you need independent research. Randomised controlled trials. Longitudinal studies. Analysis of large code repositories with no commercial stake in the outcome. Three such efforts have now published findings, and they are worth reading carefully.
DORA: speed up, stability down
Google’s DORA (DevOps Research and Assessment) programme is the industry standard for measuring software delivery performance. It has studied roughly 5,000 professionals over two years, specifically examining what AI adoption does to the four DORA metrics.
The 2024 findings: for every 25% increase in AI adoption, delivery stability dropped 7.2%. Delivery throughput fell 1.5%. Time spent on useful work decreased 2.6%. Code quality improved 3.4% and documentation quality improved 7.5%, both genuine positives. But the stability finding is the one that matters most for engineering organisations thinking about risk.
DORA’s 2025 study found the stability problem persisting. Delivery throughput had recovered to positive territory, which is good. Developer sentiment, though, dropped from 72% to 60%. That is a significant slide in how developers feel about their own work experience with these tools.
The mechanism is straightforward. AI enables developers to produce larger changesets faster. Bigger batches carry higher failure risk, something DORA has documented for a decade independent of AI. Teams are also reporting over-reliance on AI during code review: reviews speed up, but defects get missed. The intuitive hypothesis, that teams would “fail fast, fix fast” and net out ahead, does not appear in the data. Instability drives burnout and harms product quality, and DORA’s longitudinal data does not show recovery.
METR: experienced developers, going slower
The METR randomised controlled trial is the study that should get the most attention, because it was designed to control for the variables that make self-reported productivity surveys unreliable.
METR recruited experienced open-source developers, people who know their codebases well, and randomly assigned them to complete tasks with and without AI assistance. The result: developers with AI tools were 19% slower than those without, on tasks in codebases they were already familiar with.
Those same developers believed they were 20% faster.
The gap between perceived and actual productivity matters. Developers are confident that AI is helping them. On familiar tasks in familiar codebases, the evidence says that confidence is misplaced. The overhead of prompting, reviewing, and correcting AI output, what the DX research calls “a new type of interruption,” appears to outweigh the time saved. At least for experienced engineers working in territory they already know.
This does not mean AI is useless for senior developers. It means the productivity case is much more context-dependent than the marketing claims suggest.
GitClear: the code quality signal
GitClear analysed 211 million changed lines of code across five years (2020-2024), tracking quality metrics as AI adoption rose. The findings are uncomfortable.
Code churn, code that gets revised within two weeks of being committed, rose from 3.1% in 2020 to 5.7% in 2024. That is code that was written, merged, and then immediately changed. Duplicated code blocks grew at 4x the rate of prior years, with copy/paste lines rising from 8.3% to 12.3% of the total. Most striking: the share of refactored (moved or restructured) lines dropped from 24.1% in 2020 to 9.5% in 2024. A 60% decline.
GitClear’s interpretation, which aligns with the DORA data: AI makes it easy to generate code, but the harder work of restructuring and improving existing code, the work that pays down technical debt, is declining. More code is being added. Less is being cleaned up. The result is a codebase that grows faster and degrades faster.
What these three studies have in common
All three point at the same thing: AI accelerates code production but has not yet been shown to improve the quality or stability of software delivery at scale.
That is not an argument against using AI. It is an argument about what the current tools are actually optimising for. Throughput is not the same as productivity. Teams that treat it as such and ship faster without addressing quality and stability costs are building up a bill that the DORA data says will eventually come due.
Sources: Google DORA 2024 Report, DORA 2025 State of AI-Assisted Software Development, METR Randomised Controlled Trial (2025), GitClear AI Copilot Code Quality Research, 211M lines of code (2020-2024).
Read these three studies together and the honest headline writes itself: AI is making software development less stable and degrading code quality at scale. The individual findings are bad enough. The pattern across all three is worse.
DORA: the stability finding is damning
A 7.2% drop in delivery stability per 25% increase in AI adoption. That finding persisted across two annual reports, covering roughly 5,000 professionals. This is the most robust longitudinal finding in the AI productivity literature, and it says that AI adoption directly correlates with more production failures.
The throughput recovery in the 2025 report is not the good news people want it to be. Throughput measures how fast you ship. Stability measures whether what you ship works. Recovering throughput while stability stays degraded means you are shipping broken software faster. That is not progress.
Developer satisfaction dropping from 72% to 60% in a single year is what burnout looks like before it shows up in attrition numbers. These are not developers complaining about a new tool. These are developers reporting that their daily work experience has gotten worse. When 12% of your workforce shifts from satisfied to dissatisfied in twelve months, you have a retention problem forming, and it will not show up in your headcount data until it is too late to reverse.
The 3.4% code quality improvement and 7.5% documentation quality improvement are real. They are also small consolation when your production failure rate is climbing. A well-documented system that fails more often is still a system that fails more often.
METR: the most carefully controlled finding contradicts the narrative
The METR randomised controlled trial is the gold standard in this literature. Experienced open-source developers. Familiar codebases. Random assignment to AI-assisted and unassisted conditions. Objective measurement of task completion time.
The result: 19% slower with AI. Not faster. Slower.
And here is the part that should concern every engineering leader: those same developers believed they were 20% faster. The gap between perceived and actual productivity is nearly 40 percentage points. Developers are confidently wrong about whether AI is helping them.
This finding has direct implications for every organisation relying on developer self-reports to justify AI investment. If experienced developers consistently overestimate AI’s benefit by 40 points, then the satisfaction surveys and adoption metrics that most companies use to measure “AI success” are measuring confidence, not competence. You cannot trust the people using the tool to tell you whether the tool works. You need objective measurement, and most organisations do not have it.
GitClear: technical debt accumulating in production
GitClear’s dataset covers 211 million changed lines of code over five years. The trends are unambiguous.
Code churn nearly doubled, from 3.1% to 5.7%. This is code that gets written, merged, and revised within two weeks. It is the clearest signal we have that AI-generated code is not surviving contact with production.
Duplicated code blocks grew at 4x the rate of prior years. Copy/paste lines rose from 8.3% to 12.3%. AI coding assistants are generating code that already exists elsewhere in the codebase, and developers are not catching it. Each duplicated block is a maintenance liability and a future inconsistency.
The refactoring decline is the most alarming number in the entire dataset. The share of refactored lines dropped from 24.1% to 9.5%, a 60% decline. Refactoring is how codebases stay healthy. It is how you pay down technical debt, consolidate abstractions, and keep complexity manageable. That work is vanishing from the commit record, replaced by net-new code that is cheaper to generate but more expensive to maintain.
The maths are straightforward. More code generated + less code refactored + higher churn rate = accelerating technical debt. The teams living with these codebases in two years will pay the price of today’s AI-generated shortcuts.
The pattern across all three
DORA says stability is degrading. METR says experienced developers are slower. GitClear says code quality metrics are deteriorating across the industry.
The three studies agree on the mechanism: AI speeds up code generation and degrades everything downstream. Review quality drops. Refactoring declines. Failure rates climb. Developer satisfaction falls.
The productivity narrative depends on equating “more code produced” with “more value delivered.” These three studies, taken together, say that equation does not hold.
Sources: Google DORA 2024 Report, DORA 2025 State of AI-Assisted Software Development, METR Randomised Controlled Trial (2025), GitClear AI Copilot Code Quality Research, 211M lines of code (2020-2024).
The positive signals in these three studies are getting buried under cautionary headlines. That is a mistake. Read the data carefully and the trajectory is clear: AI is improving the fundamentals of software delivery, and the early instability is exactly what you would expect from any major tooling shift.
DORA: the recovery matters more than the dip
Yes, delivery stability dropped 7.2% per 25% increase in AI adoption in DORA’s 2024 report. That is a real finding. But look at what happened next. By the 2025 study, delivery throughput had recovered from negative to positive territory. Code quality was up 3.4%. Documentation quality was up 7.5%. Those are durable improvements measured across roughly 5,000 professionals.
The stability dip follows a pattern DORA has documented before, well before AI entered the picture. Any time teams increase batch sizes and change velocity, failure rates temporarily spike. This happened with CI/CD adoption. It happened with microservices migrations. The question is not whether there is a dip. The question is whether teams learn to stabilize at the new speed. DORA’s own throughput recovery suggests they are starting to.
The developer sentiment drop from 72% to 60% deserves context too. Developers in the middle of a tooling transition always report lower satisfaction. That number will be worth watching in the 2026 report, but treating it as a permanent verdict on AI rather than a signal of transition friction misreads the data.
METR: a narrow finding, applied too broadly
The METR randomised controlled trial found experienced developers were 19% slower with AI tools on tasks in codebases they already knew well. That is the most carefully controlled finding in this space, and it deserves respect. But the scope of the finding matters as much as the result.
METR tested one specific scenario: experienced developers, familiar codebases, routine tasks. This is precisely the context where AI offers the least marginal benefit. The developer already knows the answer. The AI has to guess it. The prompt-review-correct loop adds overhead with minimal upside.
Apply METR’s own logic to the inverse scenario and the picture changes completely. Junior developers on unfamiliar codebases, the population measured in the Cui et al. multi-company RCT, showed 21-40% gains. Those are randomised controlled trial results, not self-reported surveys. The onboarding data from DX Research shows time-to-10th-PR cut in half. These are the cases where AI adds information the developer does not already have, and the gains are large and consistent.
METR does not show that AI slows down developers. It shows that AI slows down developers who do not need it for the specific task at hand. That distinction is everything.
GitClear: a discipline problem, not an AI problem
GitClear’s data on rising code churn (3.1% to 5.7%) and declining refactoring (24.1% to 9.5%) is real and worth taking seriously. But attributing those trends to AI as a technology, rather than to how teams are using AI, conflates the tool with the process.
Teams that generate AI code and merge it without proper review are going to churn code. That is not surprising. Teams that treat AI output as a first draft, review it properly, and enforce quality gates are not showing the same patterns. The churn problem is a review process problem. The refactoring decline is a prioritisation problem. Both are solvable without abandoning the tool.
The copy/paste increase (8.3% to 12.3%) is the most actionable finding. It tells you something specific about how coding assistants work today: they are better at generating new code than at finding existing code to reuse. That is a tooling gap, not a fundamental limitation. Codebase-aware retrieval is improving rapidly.
The trajectory, not the snapshot
Reading these three studies as a single indictment misses the signal in the noise. The direction is: early instability, improving fundamentals. DORA’s throughput recovery is real. Code quality and documentation quality gains are real. The populations where AI has the clearest advantage, junior developers and developers new to a codebase, are growing in both sample size and effect magnitude across successive studies.
The teams that will benefit most are the ones investing in the right adoption patterns now, not the ones waiting for the numbers to look perfect before they start.
Sources: Google DORA 2024 Report, DORA 2025 State of AI-Assisted Software Development, METR Randomised Controlled Trial (2025), GitClear AI Copilot Code Quality Research, 211M lines of code (2020-2024).