By David Brainard, CTO of EverQuote.
The blog posts and vendor recommendations about AI coding patterns have one thing in common: they don’t know what works for your codebase.
A pattern that raises one-shot success by 20 points in one organisation might do nothing in yours. A model that benchmarks beautifully might fall apart on your specific mix of complexity and architectural constraints. Without measuring what’s actually happening in your pipeline, you can’t know which it is, and you’ll keep settling questions by whoever argues most confidently in a meeting.
The factory without a dashboard is a black box
Most teams adopting agentic coding today are doing so by prescription. They read a framework recommendation, watch a demo, copy a prompting pattern. When it doesn’t work as expected, they adjust by intuition: more context, different model, tighter instructions. Sometimes that helps. Sometimes it doesn’t. The root problem is the same either way: there’s no measurement.
This isn’t a criticism of the teams. The tooling and culture for measuring AI coding pipelines doesn’t exist yet the way it does for, say, deployment frequency or test coverage. But without instrumentation, every decision about how to build your factory is an opinion, not a fact. You’re optimising a process you cannot see.
The fix is to treat the factory as a funnel.
The CRO analogy is exact
Every AI coding run is a conversion event. Context goes in: business requirements, architectural specs, story-level detail. Shipped code comes out. Between those two endpoints, there are measurable stages where the run can succeed, fail, or require human intervention.
This is CRO applied to code generation. The funnel has seven stages: Change Identity (what is this run, under what conditions?), Context Inputs (did the AI have enough to work with?), AI Execution (what did it produce, across how many attempts?), Verification (did it pass automated checks?), Review and Human Intervention (did a human need to fix it?), Deployment and Outcomes (did it ship and deliver value?), and Derived Metrics (how is the factory doing overall?).
At each stage, runs drop out. Instrument the stages and you can see exactly where your factory leaks, and what fixing it is worth.

Without the funnel, a falling success rate is a mystery. With it, you know whether the problem is in context assembly, self-correction loops, test feedback quality, or review trust. Those require different interventions.
The four metrics that matter
The derived metrics at stage G give you a complete picture of factory health. Four numbers, each catching something the others miss.
One-shot success rate is the north star: did the AI’s first attempt pass CI and merge without human modification? It’s the purest measure of factory quality (no loops, no human fixes). It’s also gameable in isolation, which is why you need the others.
Eventual autonomous success measures whether AI output eventually merged without human code changes, across any number of self-correction loops. The gap between this and one-shot success tells you how much work your scaffolding is doing. A 35%/37% split means the loops are nearly useless. That’s a feedback problem, not a model problem.
Cycle time (PR open to merge) measures human friction. One-shot success rising, cycle time also rising: engineers are applying exhaustive review to AI PRs even when the code is good. The bottleneck has moved downstream.
MLTC (commit to deploy) is the honesty check. A rising one-shot rate with flat MLTC means something downstream (review, deployment, process overhead) is absorbing the gains.

One-shot success tells you what the AI is doing. Cycle time and MLTC tell you what the organisation is doing with it.
What the data actually tells you
Four examples, each surfacing something invisible without instrumentation.
The context experiment. One-shot success on a particular repository is 30%, well below average. Failure classification shows the misses are overwhelmingly “requirements gap”: code that compiles and passes basic tests but doesn’t match what was actually needed. Functionally wrong, not syntactically wrong. The hypothesis: the AI is missing initiative-level context explaining why the change matters.
We split incoming runs for two weeks: half got initiative-level specs added to the prompt, half ran with story context only, both tagged for comparison. The result was clean.

One-shot success on the enriched group hits 52%. The control stays at 31%. Initiative context is load-bearing for this repository, and it becomes the default. That’s a question that would normally be settled by opinion. Instead, it’s a number.
The scaffolding value discovery. One team: 35% one-shot success, 37% eventual autonomous success. A nearly nonexistent gap. Pulling the attempt chain data and diffing attempt-to-attempt reveals the problem: the AI makes the same mistake on every retry. It’s not learning from CI failures because the test suite produces generic error messages: “Test failed” with a stack trace and no description of what the test was checking. The AI can’t diagnose what went wrong, so each retry is a random swing.
After updating the test output formatting, eventual autonomous success on that team rises to 64%. The intervention isn’t a model change or a context change. It’s making the test suite a better teacher.
The model swap. Running both a frontier model and an open-weight model on the same tasks for a full sprint, stratified by complexity, shows the headline gap (65% vs. 48% one-shot success) is almost entirely concentrated in complex and architectural changes. On standard-complexity work, they’re nearly identical: 63% vs. 59%. That’s a routing rule, not a preference: open-weight for standard complexity, frontier for the rest. And the failure classifications on the open-weight misses point directly at Phase 3 fine-tuning targets.
The hidden bottleneck. One-shot success is climbing week over week, sitting at 55% and trending up. The factory looks fine. But MLTC hasn’t moved. Cycle time has doubled. Engineers are doing exhaustive line-by-line review on AI-generated PRs while human-authored PRs of the same complexity get a quick scan. The factory improved. The organisation didn’t adapt.

The fix is process and trust, not engineering. A tiered review policy (AI PRs that pass all CI checks and match known patterns get an expedited track) drops cycle time by 40% in the first month. You only see this if you’re watching MLTC alongside one-shot success.
The counterargument: instrumentation is overhead
The real objection isn’t “measurement is bad.” It’s “we don’t have time to build this before we need to ship.” That’s fair in organisations where AI coding is touching a handful of PRs per week.
At scale it flips. With thousands of pull requests moving through a system each week, statistically meaningful datasets accumulate in days. A two-week experiment on context enrichment is enough to detect a 20-point lift with confidence. You’re not theorising. You’re measuring.
The telemetry schema you build for Phase 1 is the same data that enables Phase 2 (model comparison and open-weight substitution) and Phase 3 (fine-tuning). Skipping it in Phase 1 means reconstructing it later at higher cost, with less history behind it.
The sequencing is the whole point. Build the telemetry before you pick a framework, commit to a model, or standardise on a prompting pattern. You don’t need to have solved the pipeline first. Just capture what it’s doing: one row per attempt, across the seven stages.
From there, the decisions are driven by data you actually have. Which context inputs move one-shot success for this codebase? How much is scaffolding contributing versus hurting? Where is the complexity threshold where the open-weight model starts to underperform?
Teams that instrument early find the context enrichments that work, catch scaffolding failures before they compound, and see hidden bottlenecks before they quietly erase months of improvement. Teams that don’t are optimising by feel, in a domain where the data is right there if you build the pipes to collect it.
Build the dashboard first. The factory will tell you what it needs.
The CRO analogy for AI coding is seductive. Map your pipeline to a funnel, instrument every stage, run controlled experiments, let the data decide. Looks great on a slide.
Before you commit to it: is your AI coding pipeline actually a funnel?
What a funnel assumes
CRO works when you have a well-defined conversion event, a repeatable process, and enough volume to detect signal. Web funnels check all three: millions of users, identical flows, clear conversion definitions.
AI code generation checks some of these and fails on one. The conversion event is clear enough: context in, shipped code out. The process is reasonably repeatable. Volume, at scale, is real.
The failure is in the assumption that drop-off points are independent. In a web funnel, someone who abandons at checkout doesn’t affect someone who completes the purchase. In an AI coding pipeline, a bad context input at stage B doesn’t just affect that run. It can corrupt the self-correction loop at stage C, inflate verification failure rates at stage D, and produce a misleading signal at stage G. The stages aren’t independent. The funnel metaphor hides that.
The instrumentation is still worth doing
This isn’t an argument against measurement. The seven-stage funnel and its telemetry schema are genuinely useful. The four derived metrics (one-shot success, eventual autonomous success, cycle time, MLTC) capture things that matter. The hidden bottleneck example (rising one-shot success, flat MLTC, doubled cycle time) is a real failure mode that measurement catches.
The instruments are good. The question is whether CRO is the right frame for interpreting what they find.

Where the analogy breaks down
Culture isn’t a funnel stage. The hidden bottleneck example ends with a process intervention (a tiered review policy), not a technical one. That intervention only worked because leadership could change how engineers review AI PRs. In organisations where trust in AI output is low, where senior engineers treat automated checks with scepticism, or where the review culture is deeply embedded, no amount of cycle time data changes behaviour. The bottleneck is organisational. Measuring it doesn’t fix it.
The CRO framing implies that if you can measure the problem, you can fix it. True for context enrichment experiments. Not true for cultural resistance to AI-generated code.
Controlled experiments require genuine control. The context enrichment experiment gives a clean number: 31% to 52% one-shot success with initiative-level specs added. But the experiment assumes the enriched and control groups receive comparable tasks. In a real pipeline, task complexity, the team handling the work, and the repository all shift the baseline. The direction is probably right. The 21-point figure is an estimate, not a measurement.

Metrics tell you that something is wrong, not what. One-shot success and MLTC diverging signals a bottleneck downstream. It doesn’t tell you it’s a trust problem. That diagnosis requires talking to the engineers, qualitative work that no dashboard replaces.

The volume assumption doesn’t scale down
The framework works at high volume. With 3,500 pull requests per week, two weeks of data is meaningful. Signal accumulates fast and the telemetry value compounds.
Most teams aren’t there. At 50 PRs per week, or 200, the confidence intervals on a context enrichment experiment are wide. You might wait months for a result you could have gotten from a senior engineer reading 30 failure cases.
The instrumentation-first approach is right for organisations operating at scale. For smaller teams, the overhead-to-insight ratio is worse than the framework implies, and qualitative diagnosis might get you there faster.

What to take from this
Pull out the CRO framing and the core advice still holds: measure what’s happening before you commit to a toolchain. The telemetry schema is well-designed. The four metrics are a sensible set. The patterns instrumentation finds (broken test feedback loops, model routing rules, hidden review bottlenecks) are real.
The reservation is about epistemics, not measurement. A funnel model assumes problems are localised and show up in aggregate metrics. Some do. The ones rooted in culture and trust don’t. Those need a different kind of attention.
Build the telemetry. Run the experiments. But stay sceptical about what the numbers can tell you, and what they can’t.
If your team is adopting agentic coding without instrumenting the pipeline, you’re already behind the teams that are. Not because you picked the wrong framework or model. Because you have no way to find out.
The gap isn’t technical capability. It’s data. Teams that instrument their AI coding pipelines accumulate something the others don’t: a dataset that tells them exactly what’s working, in their codebase, with their specific mix of complexity and context. Teams that skip instrumentation run the same experiments and throw away the results.
The factory without a dashboard is a black box
Every opinion-based AI coding strategy has the same failure mode: when results disappoint, you don’t know why. More context, different model, tighter prompts: you’re adjusting knobs in the dark. Sometimes something helps. You still don’t know why.
This isn’t defensible when you’re processing thousands of AI-generated pull requests. At that volume, every week without instrumentation is a week of signal discarded.
Treat the factory as a funnel. Instrument every stage. The data will tell you what the opinions won’t.
The CRO analogy is exact
Every AI coding run is a conversion event. Context goes in. Shipped code comes out. In between, there are seven measurable stages where the run succeeds, fails, or requires human intervention: Change Identity, Context Inputs, AI Execution, Verification, Review and Human Intervention, Deployment and Outcomes, Derived Metrics.
This is CRO applied to code generation. The same measurement discipline that lets marketing teams optimise funnels applies directly to your AI pipeline, and teams that have already built CRO competency are in a strong position to apply it here.

At each stage, runs drop out. Without instrumentation, you’re looking at aggregate success rates and guessing at causes. With it, you know exactly where your factory leaks and what a targeted fix is worth.
The four metrics that matter
Four numbers give you a complete picture of factory health. None is optional.
One-shot success rate: did the AI’s first attempt pass CI and merge without human modification? Your north star. Also gameable in isolation, which is exactly why you need the others.
Eventual autonomous success: did AI output eventually merge without human code edits, across any number of loops? The gap between this and one-shot success measures scaffolding value. A 35%/37% split means the loops are doing nothing. Broken feedback loop, not a model problem.
Cycle time: PR open to merge. One-shot success climbing but cycle time also climbing: engineers are reviewing AI output more carefully than it deserves. The bottleneck has moved downstream.
MLTC: commit to deploy. The honesty check. Rising one-shot success with flat MLTC means something downstream is absorbing the gains. You won’t see it without MLTC.

What the data proves
These aren’t hypotheticals. They’re what shows up within the first weeks of operating an instrumented factory.
Context is load-bearing, and measurable. Half the incoming runs get initiative-level specs added to the prompt, half run with story context only. Two weeks of data.

One-shot success jumps from 31% to 52% on the enriched group. Twenty-one points from a context change that took an afternoon to design and two weeks to confirm. Not an edge case. This is what instrumentation finds routinely.
Broken scaffolding is invisible without attempt chains. One team: 35% one-shot success, 37% eventual autonomous. The loops are doing nothing. Pull the attempt chain data and the pattern is clear: the AI makes the same mistake on every retry because the test suite produces generic error messages. Fix the test output formatting, not the model or the prompt. Without attempt-to-attempt diff analysis, no one finds this.
Model routing is a data problem, not a debate. Run both a frontier and an open-weight model on the same tasks for a sprint, stratified by complexity. The headline gap (65% vs. 48% one-shot success) collapses to near-zero on standard-complexity work. The open-weight model underperforms only on complex and architectural changes. That’s a routing rule, not a preference. And the failure classifications on the misses are a precise target for fine-tuning.
Hidden bottlenecks are invisible without MLTC. One-shot success rising week over week. Factory looks good. But MLTC hasn’t moved. Cycle time has doubled.

Engineers are reviewing AI-generated PRs exhaustively even when the code is fine. The factory improved. The organisation didn’t. A tiered review policy (expedited track for AI PRs that pass all CI checks and match known patterns) drops cycle time by 40% in one month. Only if you’re watching MLTC.
Instrumentation is not overhead
Teams for whom instrumentation is genuinely expensive are teams running AI coding at very low volume. At thousands of pull requests per week, statistically meaningful results accumulate in days. A two-week experiment confirms a 20-point lift. The telemetry cost is trivial relative to running experiments you can’t interpret.
The instrumentation you build for Phase 1 is also the data you need for Phase 2 model substitution and Phase 3 fine-tuning. Build it once, use it across every phase. Skip it now and you’ll rebuild it later at higher cost, with less history behind it.
Build the telemetry first
Don’t pick a framework. Don’t commit to a model. Don’t standardise on a prompting pattern. Build the instrumentation first: one row per attempt, across the seven funnel stages. Then let the data drive the decisions.
The teams not doing this are optimising by feel in a domain that rewards precision. The gap compounds every week.
Build the dashboard first. The factory will tell you what it needs.
Comments
Loading comments…
Leave a comment