sdlcnext.com
← All posts
ai-engineering agentic-coding developer-productivity measurement

Your AI Code Factory Is a Funnel. Start Treating It Like One.

Most engineering teams adopt AI coding patterns by opinion. The alternative: instrument your code generation pipeline before you commit to any framework, model, or toolchain.


Viewpoint

By David Brainard, CTO of EverQuote.

The blog posts and vendor recommendations about AI coding patterns have one thing in common: they don’t know what works for your codebase.

A pattern that raises one-shot success by 20 points in one organisation might do nothing in yours. A model that benchmarks beautifully might fall apart on your specific mix of complexity and architectural constraints. Without measuring what’s actually happening in your pipeline, you can’t know which it is, and you’ll keep settling questions by whoever argues most confidently in a meeting.

The factory without a dashboard is a black box

Most teams adopting agentic coding today are doing so by prescription. They read a framework recommendation, watch a demo, copy a prompting pattern. When it doesn’t work as expected, they adjust by intuition: more context, different model, tighter instructions. Sometimes that helps. Sometimes it doesn’t. The root problem is the same either way: there’s no measurement.

This isn’t a criticism of the teams. The tooling and culture for measuring AI coding pipelines doesn’t exist yet the way it does for, say, deployment frequency or test coverage. But without instrumentation, every decision about how to build your factory is an opinion, not a fact. You’re optimising a process you cannot see.

The fix is to treat the factory as a funnel.

The CRO analogy is exact

Every AI coding run is a conversion event. Context goes in: business requirements, architectural specs, story-level detail. Shipped code comes out. Between those two endpoints, there are measurable stages where the run can succeed, fail, or require human intervention.

This is CRO applied to code generation. The funnel has seven stages: Change Identity (what is this run, under what conditions?), Context Inputs (did the AI have enough to work with?), AI Execution (what did it produce, across how many attempts?), Verification (did it pass automated checks?), Review and Human Intervention (did a human need to fix it?), Deployment and Outcomes (did it ship and deliver value?), and Derived Metrics (how is the factory doing overall?).

At each stage, runs drop out. Instrument the stages and you can see exactly where your factory leaks, and what fixing it is worth.

AI code generation funnel: 7 stages from Change Identity to Derived Metrics

Without the funnel, a falling success rate is a mystery. With it, you know whether the problem is in context assembly, self-correction loops, test feedback quality, or review trust. Those require different interventions.

The four metrics that matter

The derived metrics at stage G give you a complete picture of factory health. Four numbers, each catching something the others miss.

One-shot success rate is the north star: did the AI’s first attempt pass CI and merge without human modification? It’s the purest measure of factory quality (no loops, no human fixes). It’s also gameable in isolation, which is why you need the others.

Eventual autonomous success measures whether AI output eventually merged without human code changes, across any number of self-correction loops. The gap between this and one-shot success tells you how much work your scaffolding is doing. A 35%/37% split means the loops are nearly useless. That’s a feedback problem, not a model problem.

Cycle time (PR open to merge) measures human friction. One-shot success rising, cycle time also rising: engineers are applying exhaustive review to AI PRs even when the code is good. The bottleneck has moved downstream.

MLTC (commit to deploy) is the honesty check. A rising one-shot rate with flat MLTC means something downstream (review, deployment, process overhead) is absorbing the gains.

The four factory health metrics and how they relate

One-shot success tells you what the AI is doing. Cycle time and MLTC tell you what the organisation is doing with it.

What the data actually tells you

Four examples, each surfacing something invisible without instrumentation.

The context experiment. One-shot success on a particular repository is 30%, well below average. Failure classification shows the misses are overwhelmingly “requirements gap”: code that compiles and passes basic tests but doesn’t match what was actually needed. Functionally wrong, not syntactically wrong. The hypothesis: the AI is missing initiative-level context explaining why the change matters.

We split incoming runs for two weeks: half got initiative-level specs added to the prompt, half ran with story context only, both tagged for comparison. The result was clean.

Context experiment: one-shot success 31% (control) vs 52% (enriched with initiative context)

One-shot success on the enriched group hits 52%. The control stays at 31%. Initiative context is load-bearing for this repository, and it becomes the default. That’s a question that would normally be settled by opinion. Instead, it’s a number.

The scaffolding value discovery. One team: 35% one-shot success, 37% eventual autonomous success. A nearly nonexistent gap. Pulling the attempt chain data and diffing attempt-to-attempt reveals the problem: the AI makes the same mistake on every retry. It’s not learning from CI failures because the test suite produces generic error messages: “Test failed” with a stack trace and no description of what the test was checking. The AI can’t diagnose what went wrong, so each retry is a random swing.

After updating the test output formatting, eventual autonomous success on that team rises to 64%. The intervention isn’t a model change or a context change. It’s making the test suite a better teacher.

The model swap. Running both a frontier model and an open-weight model on the same tasks for a full sprint, stratified by complexity, shows the headline gap (65% vs. 48% one-shot success) is almost entirely concentrated in complex and architectural changes. On standard-complexity work, they’re nearly identical: 63% vs. 59%. That’s a routing rule, not a preference: open-weight for standard complexity, frontier for the rest. And the failure classifications on the open-weight misses point directly at Phase 3 fine-tuning targets.

The hidden bottleneck. One-shot success is climbing week over week, sitting at 55% and trending up. The factory looks fine. But MLTC hasn’t moved. Cycle time has doubled. Engineers are doing exhaustive line-by-line review on AI-generated PRs while human-authored PRs of the same complexity get a quick scan. The factory improved. The organisation didn’t adapt.

Hidden bottleneck: one-shot success rising while MLTC stays flat

The fix is process and trust, not engineering. A tiered review policy (AI PRs that pass all CI checks and match known patterns get an expedited track) drops cycle time by 40% in the first month. You only see this if you’re watching MLTC alongside one-shot success.

The counterargument: instrumentation is overhead

The real objection isn’t “measurement is bad.” It’s “we don’t have time to build this before we need to ship.” That’s fair in organisations where AI coding is touching a handful of PRs per week.

At scale it flips. With thousands of pull requests moving through a system each week, statistically meaningful datasets accumulate in days. A two-week experiment on context enrichment is enough to detect a 20-point lift with confidence. You’re not theorising. You’re measuring.

The telemetry schema you build for Phase 1 is the same data that enables Phase 2 (model comparison and open-weight substitution) and Phase 3 (fine-tuning). Skipping it in Phase 1 means reconstructing it later at higher cost, with less history behind it.

Start with the telemetry, not the tooling

The sequencing is the whole point. Build the telemetry before you pick a framework, commit to a model, or standardise on a prompting pattern. You don’t need to have solved the pipeline first. Just capture what it’s doing: one row per attempt, across the seven stages.

From there, the decisions are driven by data you actually have. Which context inputs move one-shot success for this codebase? How much is scaffolding contributing versus hurting? Where is the complexity threshold where the open-weight model starts to underperform?

Teams that instrument early find the context enrichments that work, catch scaffolding failures before they compound, and see hidden bottlenecks before they quietly erase months of improvement. Teams that don’t are optimising by feel, in a domain where the data is right there if you build the pipes to collect it.

Build the dashboard first. The factory will tell you what it needs.

Comments

Loading comments…

Leave a comment