Work through the data from the previous posts and a pattern holds. AI adoption is near-universal. Productivity gains have plateaued around 10%. AI-written code increases delivery instability. The tools work better for junior developers than senior ones. And the biggest single win, faster onboarding, is real but bounded.
These are actual gains. They’re just not the gains that will determine which companies win the next decade.
The real opportunity from AI isn’t making your existing product 10% faster to build. It’s cutting the cost of learning what to build in the first place.
The J-curve hypothesis
Brynjolfsson’s argument is that we are in the investment phase: organisations have adopted the technology, productivity looks flat at the developer level, and the harvest phase with accelerating returns is still ahead. The macro data from 2025 is at least consistent with this. US productivity growth hit 2.7%, nearly double the previous decade’s average. Q4 GDP came in at 3.7% with strong output despite slower job growth.
But the developer-level data from DX, DORA, and METR doesn’t show a harvest phase arriving yet. The most likely explanation is that the macro gains are coming from non-coding applications (customer service, content production, operations) rather than software development specifically. The 2026 numbers will be more decisive.
Either way, the J-curve raises the right question: if the gains so far are modest, what should you actually be optimising for?

Two ways to point AI at your organisation
Most organisations deploying AI are pointing it at the efficiency problem: how do we build the things we have already decided to build, faster and cheaper?
This is the obvious application of a tool that accelerates code production. It produces the 10% gains the data shows. It also carries the DORA stability risks. And it hits a ceiling, because the constraint on building what you have already decided to build is rarely code generation speed.
The alternative is pointing AI at the discovery problem: how do we find out, faster and cheaper, whether the thing we are thinking about building is actually the right thing?
That changes the economics of experimentation. Historically, running a software experiment has been expensive. You need to spec, design, build, deploy, and measure something before you know if it was worth building. That cost imposes a filter. Only high-confidence bets clear the bar. Low-confidence, speculative ideas don’t get built because the downside of being wrong is too high.
AI collapses that cost. A prototype that would have taken six weeks to build now takes a week. A hypothesis that would have required a full sprint to test can be running in a day. More bets clear the bar. You run more experiments. You find out what works faster.

What this means in practice
The DORA data is clear that AI increases instability in mature production systems. That instability is genuinely harmful in a system you are maintaining. It is irrelevant in a system you are testing.
Throwaway prototypes are supposed to be unstable. MVPs are supposed to be rough. The whole point of a disposable experiment is that when it fails (and most experiments fail) you learn something cheaply and move on. DORA’s instability finding is not a problem when the code is disposable by design.
So the allocation question becomes concrete. In your production codebase, apply what the research says: careful review, spec-driven approaches, measurement. AI in mature codebases requires more oversight, not less. On the edges, new product lines, adjacent ventures, speculative features, the instability penalty doesn’t apply. That’s where you let AI run.
The companies that win in an AI-saturated market won’t be the ones who shipped their roadmap 10% faster. They’ll be the ones who figured out their next product 10x cheaper.

The closing argument
Every data point in this series tells the same story. AI’s gains in existing codebases are real but modest, plateau fast, and come with stability costs. That is the honest summary of what the research shows as of February 2026.
But the same research that shows a 10% productivity ceiling on existing codebases also shows that AI can cut onboarding time in half, a gain that compounds for years. It shows that experienced developers who use AI for exploration rather than execution find the highest time savings. It shows that the organisations seeing the best outcomes used AI at the system level, not just the individual task level.
The ceiling is on optimisation. The floor on discovery hasn’t been found yet.
Don’t use AI to build faster. Use it to learn faster.
Sources: Erik Brynjolfsson, Stanford Digital Economy Lab (Feb 2026). Bureau of Labor Statistics. DX Research (Feb 2026). Google DORA 2024/2025. METR RCT (2025).
The J-curve argument is doing a lot of heavy lifting here. Let’s look at what it actually rests on.
Brynjolfsson’s framework says we’re in the investment phase and the harvest is coming. The 2025 macro data, 2.7% US productivity growth, Q4 GDP at 3.7%, is offered as early evidence. But when you look at where those gains came from, the story changes. Customer service automation. Content production. Back-office operations. Not software development.
The developer-level data from DX, DORA, and METR tells a different story entirely: a productivity plateau, not a harvest. If the J-curve’s harvest phase were arriving for software engineering, you’d expect to see it in the METR trial. You don’t. Experienced developers went 19% slower.
The “discovery” reframe lowers the bar
The argument that AI should be used for experimentation rather than execution is clever. It takes AI’s modest production-code gains and reframes them as a non-issue: we weren’t trying to optimise production code anyway. We’re running experiments.
But this reframe hides a measurement problem. If we stop measuring “did the team ship faster and more reliably” and start measuring “how many bets did the team take,” almost any investment looks good. Running more experiments is only valuable if the experiments produce valid signal. And that’s where DORA’s instability finding becomes relevant again.
Cheap prototypes give you flawed signal
The argument that DORA’s instability findings don’t matter for throwaway prototypes sounds reasonable until you think about what a prototype is for. It’s for learning. You’re testing whether users want something, whether a workflow makes sense, whether an approach is viable.
Flaky prototypes give you flaky signal. If your AI-generated MVP crashes intermittently, has inconsistent behaviour across sessions, or subtly misimplements the feature you’re testing, the data you collect is contaminated. You’re not learning whether users want the feature. You’re learning whether users can tolerate the bugs long enough to engage with it.
That’s not a clean experiment. It’s a noisy one. And noisy experiments require more samples to reach statistical significance, which erodes the cost advantage that was supposed to make the whole approach work.
The economics don’t add up the way they’re presented
The core claim is: AI collapses the cost of experimentation, so you can run more bets, so you learn faster. Each step in that chain has a hidden assumption.
AI collapses the cost of building the experiment. It doesn’t collapse the cost of designing it, deploying it, instrumenting it, collecting data, or analysing results. In a well-run product experiment, implementation is maybe 30% of the total cost. Cutting that 30% in half gives you a 15% reduction in experiment cost. Real, but not the 10x that the framing implies.
Running more bets only helps if you have the capacity to evaluate them. Most product teams are bottlenecked on analysis, not on building. They already have more ideas than they can measure. Generating prototypes faster just moves the bottleneck downstream.
The macro data is misleading
The 2025 productivity numbers are real. But attributing them to AI, specifically to AI in software development, requires a chain of inference that the data doesn’t support. Productivity growth in the US has been volatile for decades. A single strong year after a technology adoption wave is consistent with AI being the cause. It’s also consistent with post-pandemic normalisation, immigration-driven labour supply changes, and half a dozen other factors.
The honest reading of the data: AI is probably contributing to productivity in specific, measurable ways (customer service, content, operations). In software development specifically, the evidence for net productivity gains remains weak. The J-curve might be real. It might also be the tech industry’s favourite coping mechanism for investments that haven’t paid off yet.
I’d want to see the 2026 developer-level numbers before calling this one.
Sources: Erik Brynjolfsson, Stanford Digital Economy Lab (Feb 2026). Bureau of Labor Statistics. DX Research (Feb 2026). Google DORA 2024/2025. METR RCT (2025).
The J-curve is not a metaphor. It’s a prediction with a timeline, and the harvest phase is closer than the productivity data suggests.
Look at what has actually changed. A product hypothesis that would have required a full sprint to test last year can now be running in a day. Not because AI writes code marginally faster, but because the economics of running bets have shifted at the foundation. The filter on which ideas get tried at all is loosening. That’s not a 10% improvement. That’s a different game.
The productivity ceiling is the wrong metric
Everyone fixates on the DX and METR numbers showing a 10% ceiling. Fair enough. But that ceiling measures one thing: how fast teams ship tickets they’ve already written. It says nothing about the value of what gets shipped.
The real leverage is in the number of experiments a team can run per quarter. Before AI, most product organisations tested maybe two or three big bets per cycle because each one cost a full engineering sprint. Now imagine running ten. Fifteen. The win rate per experiment doesn’t need to change. The volume does all the work.
Stripe’s approach to product development already looked like this before AI: small bets, fast kills, double down on signal. AI makes that playbook available to companies that couldn’t previously afford it.
DORA instability doesn’t apply to throwaway code
The DORA finding that AI increases delivery instability is real. It is also irrelevant to the use case that matters most.
Throwaway prototypes are unstable by design. An MVP testing a hypothesis doesn’t need 99.9% uptime. It needs to run long enough to generate signal. If the experiment works, you rebuild it properly. If it doesn’t, you delete it. DORA’s metrics are designed for production systems with SLAs. Applying them to disposable experiments is a category error.
The companies I keep coming back to in this analysis are the ones that bifurcate their approach. Production code gets careful review, spec-driven methodology, all the discipline the research recommends. Experimental code gets speed. AI is pointed at discovery, not at the backlog.
The 2025 macro data is a leading indicator
US productivity growth hit 2.7% in 2025. Q4 GDP came in at 3.7% with strong output despite slower job growth. Yes, a lot of that is customer service and operations. But that’s how technology adoption works. It starts at the edges and moves inward. The developer-specific gains are still in the investment phase precisely because software development is harder to transform than content production.
Brynjolfsson’s framework predicts that developer productivity will follow, lagged by one to two years, as tooling matures and workflows adapt. The early data is consistent with that prediction.
This is the floor, not the ceiling
The current state of AI-assisted development is the worst it will ever be. The tools are immature. The workflows haven’t adapted. The measurement frameworks are still catching up. And even now, with all of that friction, teams are already reporting that experimentation velocity has meaningfully increased.
The companies that figure out how to run 10x more product bets cheaply will outcompete those who only applied AI to their backlog. That gap will compound. Each experiment generates learning, each learning informs the next bet, and the cycle accelerates.
Ten percent faster on existing work is table stakes. The real question is who learns fastest. That race is just starting.
Sources: Erik Brynjolfsson, Stanford Digital Economy Lab (Feb 2026). Bureau of Labor Statistics. DX Research (Feb 2026). Google DORA 2024/2025. METR RCT (2025).