sdlcnext.com
← All posts
AI DORA research data code-quality

What the Research Actually Shows: DORA, METR, and GitClear

Three independent research programmes have now published rigorous data on AI's impact on software delivery. The results are more nuanced — and more concerning in places — than the vendor claims suggest.


Viewpoint

Vendor case studies are useful data points. They are not evidence. When OpenAI reports that Codex users inside OpenAI merge 60% more PRs per week, you should note the context: 95% of OpenAI developers use Codex internally, which may reflect organisational expectation as much as organic productivity gain.

For evidence, you need independent research. Randomised controlled trials. Longitudinal studies. Analysis of large code repositories with no commercial stake in the outcome. Three such efforts have now published findings, and they are worth reading carefully.

DORA: speed up, stability down

Google’s DORA (DevOps Research and Assessment) programme is the industry standard for measuring software delivery performance. It has studied roughly 5,000 professionals over two years, specifically examining what AI adoption does to the four DORA metrics.

The 2024 findings: for every 25% increase in AI adoption, delivery stability dropped 7.2%. Delivery throughput fell 1.5%. Time spent on useful work decreased 2.6%. Code quality improved 3.4% and documentation quality improved 7.5%, both genuine positives. But the stability finding is the one that matters most for engineering organisations thinking about risk.

DORA’s 2025 study found the stability problem persisting. Delivery throughput had recovered to positive territory, which is good. Developer sentiment, though, dropped from 72% to 60%. That is a significant slide in how developers feel about their own work experience with these tools.

The mechanism is straightforward. AI enables developers to produce larger changesets faster. Bigger batches carry higher failure risk, something DORA has documented for a decade independent of AI. Teams are also reporting over-reliance on AI during code review: reviews speed up, but defects get missed. The intuitive hypothesis, that teams would “fail fast, fix fast” and net out ahead, does not appear in the data. Instability drives burnout and harms product quality, and DORA’s longitudinal data does not show recovery.

METR: experienced developers, going slower

The METR randomised controlled trial is the study that should get the most attention, because it was designed to control for the variables that make self-reported productivity surveys unreliable.

METR recruited experienced open-source developers, people who know their codebases well, and randomly assigned them to complete tasks with and without AI assistance. The result: developers with AI tools were 19% slower than those without, on tasks in codebases they were already familiar with.

Those same developers believed they were 20% faster.

The gap between perceived and actual productivity matters. Developers are confident that AI is helping them. On familiar tasks in familiar codebases, the evidence says that confidence is misplaced. The overhead of prompting, reviewing, and correcting AI output, what the DX research calls “a new type of interruption,” appears to outweigh the time saved. At least for experienced engineers working in territory they already know.

This does not mean AI is useless for senior developers. It means the productivity case is much more context-dependent than the marketing claims suggest.

GitClear: the code quality signal

GitClear analysed 211 million changed lines of code across five years (2020-2024), tracking quality metrics as AI adoption rose. The findings are uncomfortable.

Code churn, code that gets revised within two weeks of being committed, rose from 3.1% in 2020 to 5.7% in 2024. That is code that was written, merged, and then immediately changed. Duplicated code blocks grew at 4x the rate of prior years, with copy/paste lines rising from 8.3% to 12.3% of the total. Most striking: the share of refactored (moved or restructured) lines dropped from 24.1% in 2020 to 9.5% in 2024. A 60% decline.

GitClear’s interpretation, which aligns with the DORA data: AI makes it easy to generate code, but the harder work of restructuring and improving existing code, the work that pays down technical debt, is declining. More code is being added. Less is being cleaned up. The result is a codebase that grows faster and degrades faster.

What these three studies have in common

All three point at the same thing: AI accelerates code production but has not yet been shown to improve the quality or stability of software delivery at scale.

That is not an argument against using AI. It is an argument about what the current tools are actually optimising for. Throughput is not the same as productivity. Teams that treat it as such and ship faster without addressing quality and stability costs are building up a bill that the DORA data says will eventually come due.


Sources: Google DORA 2024 Report, DORA 2025 State of AI-Assisted Software Development, METR Randomised Controlled Trial (2025), GitClear AI Copilot Code Quality Research, 211M lines of code (2020-2024).