Where AI Actually Delivers ROI: A Practical Guide

The aggregate productivity numbers from AI research are modest and plateauing. But aggregate numbers hide enormous variance. Some developers are saving 4+ hours a week. Some are going slower. Some teams have halved their onboarding time. Others are drowning in review queues.

The question is not “does AI improve productivity?” The honest answer to that is “sometimes, for some people, on some tasks.” The useful question is: where does AI reliably deliver, and how do you set up the conditions for it?

The clearest win: onboarding

Of all the productivity metrics in the DX Research dataset, the onboarding result stands out for its size and its clarity.

Between Q1 2024 and Q4 2025, the time for a new developer to reach their 10th pull request, a standard industry proxy for “productive contributor,” was cut in half. Fifty percent reduction. That is not the kind of gain you see in self-reported time-savings surveys. That is a measurable, objective outcome tracked against a consistent benchmark.

The gains compound. A developer who ramps up faster starts contributing to production earlier. The DX research found the productivity boost from faster onboarding persists for two or more years. When you run the maths, the ROI from AI-assisted onboarding is among the strongest in the entire AI productivity literature, and it scales directly with how often you hire.

If you are deploying AI tools to exactly one population, make it new developers and developers new to a codebase.

Junior developers vs senior engineers

The seniority question is where most organisations get their assumptions backwards.

Three independent studies now point in the same direction. A multi-company randomised controlled trial by Cui et al. (covering 4,867 developers across Microsoft, Accenture, and a Fortune 100 company) found 21-40% productivity gains for junior and mid-level developers. The METR randomised controlled trial found experienced developers were 19% slower with AI tools, on tasks in codebases they knew well.

The intuition behind most enterprise AI rollouts is that senior engineers should be the priority, since they are more expensive per hour and time savings are worth more. The data says the opposite: junior developers show the largest and most consistent gains, and senior engineers who already know the answer derive less benefit from a tool that needs to guess it.

This does not mean senior engineers should not use AI. Staff+ engineers who do adopt daily save around 4.4 hours per week, the highest absolute time saving of any group. The point is that adoption should not be forced at the senior level. Remove barriers, measure outcomes, and let senior developers self-select into the use cases where they find it genuinely useful. For many, that turns out to be architectural exploration, code review assistance, and complex query writing, not inline completion.

The task-level ROI breakdown

The returns from AI are not uniform across task types. Based on the combined evidence:

The clearest returns come from tasks where the developer is navigating unfamiliar territory: onboarding to a new codebase, boilerplate and scaffolding, test generation, documentation. AI’s pattern-matching is strongest when the developer does not have strong existing intuitions to override.

Refactoring, stack trace analysis, code review assistance, and migrations produce useful but less consistent gains. The overhead of reviewing AI output becomes more significant as tasks get more nuanced.

Complex architecture decisions in familiar codebases, deep system design, tasks where the senior engineer already knows the answer: this is where the METR finding lives. The prompt-review-correct loop adds friction rather than removing it.

The common thread: AI’s value is inversely correlated with how well you already know the territory. When you are navigating something new, AI is a remarkable accelerant. When you are operating in deeply familiar ground, the prompt-review-correct loop adds more friction than it removes.

DevEx is the prerequisite

The DX data is unambiguous on one point: the organisations that see AI working well already had strong developer experience fundamentals in place before AI arrived. This is not a coincidence.

Fast CI/CD pipelines matter more than most teams expect. If your test suite takes 45 minutes to run, the prompt-code-verify loop is broken regardless of how good the AI is.

Clear, maintained documentation improves AI performance measurably. Coding assistants reason over your codebase and perform significantly better when it has clear naming conventions and enough context for the model to make accurate inferences. The cost of poor documentation was always there; AI makes it visible faster.

Well-defined service boundaries make AI-assisted changes safer. When services have clear interfaces and responsibilities, an AI-generated change to one service is less likely to break another. In tightly coupled systems, AI-generated code creates exactly the instability DORA documented.

None of these are new ideas. What AI has done is sharpen the cost of not having them.

How to measure whether it is working

The DX AI Measurement Framework separates AI impact into three dimensions, and tracking all three matters.

Utilisation is the easiest to track: how widely are tools adopted, what is the daily versus weekly split, which teams are engaged. Necessary context, but not sufficient evidence of value.

Impact is what most organisations undertrack: actual throughput, incident rates, onboarding time, developer satisfaction scores. Not self-reported time savings. These are the numbers that determine whether the investment is paying off.

Cost is the dimension that often gets skipped: total programme cost (licences, infrastructure, training time, oversight overhead) relative to measurable gains. Which use cases have the best returns? Where is money being spent on tools that are not delivering?

Most organisations tracking AI adoption are measuring utilisation and calling it impact. The companies seeing genuine returns are measuring all three.

Sources: Multi-company RCT, Cui et al., “Effects of Generative AI on High-Skilled Work” (4,867 developers, IT Revolution 2024). METR RCT (2025). DX Research Q4 2025, Feb 2026. Faros AI (2025). DX AI Measurement Framework.

The ROI from AI coding tools is real. It is also narrow, fragile, and dependent on conditions that most organisations do not meet. The gap between “AI works in these specific scenarios” and “AI is delivering value in our organisation” is where most of the investment is being wasted.

The onboarding win has prerequisites most teams lack

Halving time-to-first-PR is a genuine finding in the DX data. I am not disputing the number. But the organisations in that dataset are not typical. They have fast CI pipelines. They have maintained documentation. They have well-defined service boundaries and clear onboarding paths.

Most engineering organisations do not have those things. If your CI takes 45 minutes, your documentation is six months stale, and your service boundaries are a polite fiction, AI-assisted onboarding is not going to halve your ramp time. It is going to help new developers produce code faster in an environment that cannot absorb it, which is exactly the instability pattern DORA documented.

The onboarding ROI is real for organisations that already invest in developer experience. For everyone else, it is a number from someone else’s context being used to justify your budget.

Junior developer gains come with a hidden cost

The Cui et al. RCT found 21-40% productivity gains for junior and mid-level developers. That is a solid finding from a well-designed study. But productivity measured over a trial period is not the same thing as career development measured over years.

When a junior developer uses AI to scaffold code they would otherwise have written by hand, they produce the deliverable faster. They also skip the learning that comes from struggling with the problem. Code comprehension, architectural reasoning, the ability to debug systems you did not build: these skills develop through friction, not through generating correct code on the first try.

I am not making a theoretical argument here. The GitClear data shows refactoring declining 60% across the industry as AI adoption rises. Refactoring is how developers learn to think structurally about code. If junior engineers stop practising it because AI makes it easier to generate new code than to restructure existing code, the productivity gain today becomes a skill deficit in three years.

No study in this literature has measured long-term skill development in AI-assisted juniors versus traditionally trained juniors. Until that data exists, treating the 21-40% gain as pure upside is premature.

The measurement problem is endemic

The DX AI Measurement Framework separates utilisation, impact, and cost for a reason: because most organisations only measure the first and call it proof of the third.

Here is what measurement looks like at most companies I have talked to. They track adoption rates. They track seat utilisation. They survey developers on perceived productivity. Then they present those numbers as evidence that the AI programme is delivering value.

None of those metrics measure value. Utilisation tells you people are using the tool. Surveys tell you people believe the tool helps. But METR showed a 40-point gap between perceived and actual productivity for experienced developers. If your measurement framework cannot detect a gap that large, it cannot tell you whether your investment is working.

Measuring impact, the middle dimension of the DX framework, requires tracking objective outcomes: throughput, incident rates, onboarding time, code quality metrics. Most organisations do not have the instrumentation to measure these things, let alone attribute changes to AI adoption specifically.

Cost measurement is even rarer. Total programme cost includes licences, infrastructure, training time, the overhead of reviewing AI-generated code, and the downstream cost of AI-introduced defects. Almost nobody tracks the last two, and they may be the largest line items.

The Staff+ savings are real but misunderstood

Staff+ engineers saving 4.4 hours per week is the finding that gets cited most in executive presentations. It is also the finding that is most consistently misapplied.

Those 4.4 hours come from Staff+ engineers who have self-selected into daily AI use. They have found specific use cases, architectural exploration, complex queries, review assistance, where AI genuinely helps. That is not the same as “AI saves every Staff+ engineer 4.4 hours.” The engineers who tried AI and stopped using it are not in that number.

More importantly, the 4.4 hours saved is a gross number, not a net number. It does not account for the time other engineers spend reviewing AI-generated code, fixing AI-introduced bugs, or dealing with the increased PR volume that AI enables. If one Staff+ engineer saves four hours by generating more PRs, and two other engineers each spend an extra hour reviewing those PRs, the net organisational saving is two hours, not four. No study in this literature measures net team-level impact.

Where does this leave you?

AI delivers ROI in specific, well-understood use cases: onboarding, junior developer acceleration, certain task types where the developer is navigating unfamiliar territory. Those returns are real and I would not argue against investing in them.

But most organisations are not investing in “specific, well-understood use cases.” They are buying enterprise licences, mandating adoption, and measuring utilisation. The gap between the research and the reality of most AI programmes is vast, and the measurement practices at most companies are not sophisticated enough to tell the difference between “AI is delivering value” and “people are using the tool we bought.”

Before expanding your AI investment, answer three questions honestly. First, can you measure objective productivity outcomes, not just adoption? Second, have you accounted for the full cost, including review overhead and quality impact? Third, are the DevEx prerequisites actually in place?

If the answer to any of those is no, you are investing on faith, not data.

The onboarding finding alone justifies broad AI deployment. But it is not alone. The data across multiple independent studies points to a clear set of use cases where AI reliably delivers, and the returns are large enough that debating aggregate productivity numbers is a distraction.

Onboarding: the compounding win

Halving time-to-first-PR is not a marginal improvement. It is a measurable, compounding return that the DX data tracks consistently from Q1 2024 through Q4 2025. Every new hire reaches productive contribution faster. That effect persists for two or more years, according to the DX longitudinal data.

Run the maths on a team that hires 20 developers a year. If each one starts delivering meaningful work a month earlier, that is 20 developer-months of additional productive output annually. At fully loaded cost, that is a return that pays for the entire AI tooling programme multiple times over.

The companies that have seen this most clearly are the ones with good DevEx fundamentals already in place: fast CI, clean documentation, well-scoped services. AI did not create the onboarding advantage from nothing. It amplified existing strengths. But the amplification is dramatic enough that organisations without those fundamentals should be investing in them specifically to unlock this return.

Junior developers: the strongest signal in the data

The Cui et al. multi-company randomised controlled trial covered 4,867 developers across Microsoft, Accenture, and a Fortune 100 company. Junior and mid-level developers showed 21-40% productivity gains. These are RCT results, not survey data. Not vibes. Controlled measurements from real work at real companies.

That 21-40% range is the clearest productivity signal in the AI literature. No other intervention in software engineering history has produced that magnitude of gain for early-career developers in a controlled setting. The closest comparison is IDE adoption in the early 2000s, and even those studies showed smaller effects.

I keep coming back to what this means in practice. A junior developer with AI assistance is closing the gap to mid-level performance faster. They are writing more complete first drafts. They are finding relevant code patterns in unfamiliar codebases more quickly. The scaffolding that would take a new developer hours of searching and reading happens in minutes. The skill-building still occurs, because they still have to understand and modify what the AI produces, but the lookup time collapses.

Staff+ engineers: high-leverage savings

Staff+ engineers who adopt AI daily save around 4.4 hours per week. That is the highest absolute time saving of any seniority level in the DX data. And these hours are not saved on boilerplate. They are saved on the highest-leverage activities: architectural exploration, complex query construction, code review.

When a Staff+ engineer saves an hour on a code review, that hour is worth more than an hour saved by a junior developer on scaffolding. The senior engineer’s time is the bottleneck. Every hour freed up from mechanical work is an hour available for system design, mentorship, and the decisions that compound across the organisation.

The adoption pattern matters here. The data is clear that forcing adoption at the senior level does not work and is counterproductive. Staff+ engineers who self-select into AI use are the ones seeing the 4.4-hour savings. The ones who are forced into it show lower satisfaction and no measurable productivity gain. Let senior engineers choose their use cases. They will find the right ones.

Stop debating the aggregate number

The aggregate productivity question, “does AI make developers more productive?”, is the wrong question. It averages together populations with a 40-point spread in outcomes. Junior developers gaining 30% and senior developers losing 19% average out to a modest positive that tells you nothing useful about where to invest.

The right questions are specific. Does AI improve onboarding time? Yes, by 50%. Does AI improve junior developer output? Yes, by 21-40% in controlled trials. Does AI save time for Staff+ engineers who choose to use it? Yes, 4.4 hours per week.

These are not “some use cases work” findings. These are reliably reproducible returns backed by independent, controlled research. Deploy into these use cases aggressively. Measure impact properly using the DX three-dimensional framework (utilisation, impact, cost). Stop waiting for the aggregate number to settle. It never will, because the aggregate number was never the right metric.