Spec-Driven Development: The Missing Link in AI Coding?

The 10% productivity plateau from unstructured AI use raises an obvious question. If the problem is that AI tools get inconsistent results from inconsistent inputs, ad hoc prompts, unclear requirements, missing context, what happens when you fix the input?

Spec-driven development (SDD) is the most serious attempt to answer that question. The core idea: treat the specification, not the code, as the primary artifact. Write a detailed, structured spec first. Let the AI implement from it. Version-control the spec alongside (or instead of) the code.

The Thoughtworks Technology Radar called it “one of the most important practices to emerge in 2025.” GitHub Spec Kit accumulated 72,000+ stars in roughly six months. AWS has shipped a commercial IDE built around the concept. This is moving fast.

What spec-driven development actually means

SDD is not a single workflow. It is a spectrum of practices defined by how much authority the spec has over the code.

Spec-first is the most common form today. The developer writes a detailed specification, the AI implements it, and the developer reviews and iterates. The spec guides generation. GitHub Spec Kit and AWS Kiro operate here.

Spec-anchored goes further: the spec both guides generation and validates the output, with automated checks verifying that the AI-generated code satisfies the spec’s constraints. This is closer to contract-driven or BDD-style workflows applied to AI-assisted development.

Spec-as-source is the radical end. Code becomes disposable output; only the spec is maintained. If you need to change behaviour, you change the spec and regenerate. Tessl is exploring this in private beta. At scale, it remains largely theoretical.

For most teams today, “spec-driven” means spec-first: structured markdown documents that give AI agents enough context to produce consistent, reviewable output.

The SDD spectrum: from spec-first to spec-as-source

The four frameworks

The SDD ecosystem has consolidated around four main approaches, each with a distinct philosophy.

GitHub Spec Kit (72K+ stars) is the de facto standard. Its workflow is Specify, Plan, Tasks, Implement, a lightweight iterative loop that stays close to agile practice. Agent-agnostic, supporting 22+ AI platforms, with a markdown-first philosophy: low overhead, high portability. Best for small to medium teams who want to move fast without adopting a heavy process.

BMAD Method is the enterprise end: 21 specialised AI agents and 50+ guided workflows simulating a full agile team, with AI personas for product manager, architect, developer, and QA. Best for large greenfield projects where governance and role clarity matter. The risk Thoughtworks flags is real: that much process can start to feel like waterfall.

OpenSpec goes the other direction and is designed specifically for brownfield environments. Rather than imposing a greenfield process on an existing codebase, it uses change-centric plain markdown specs: you write a spec for each change, not a full system spec. Best for teams maintaining legacy codebases who cannot start from scratch.

AWS Kiro is the commercial entrant. At $20/month, it bakes SDD directly into an IDE: natural language input becomes user stories, acceptance criteria, design documents, and implementation tasks automatically. Best for AWS-integrated enterprise teams who want the approach without the setup cost. The trade-off is lock-in to both the IDE and AWS’s interpretation of the workflow.

Four frameworks, four philosophies

The case for SDD

The logical argument for spec-driven development is strong. It addresses the specific failure modes that the research has documented.

GitClear found code churn rising to 5.7% and duplicate code growing 4x as AI adoption increased. The proposed mechanism: AI tools without clear specs are guessing at intent, producing code that frequently needs revision or duplication. A well-written spec removes the ambiguity that causes guessing.

DORA identified version-controlled audit trails as a success factor for high-performing teams. Spec files in version control create exactly that, a record of intent that can be reviewed, diffed, and referenced in post-mortems.

The Specify, Plan, Task, Implement workflow enforces incremental delivery, preventing the big-batch failures that DORA associates with AI-driven instability.

And the market signal is real: 72,000 GitHub stars in six months, enterprise adoption via AWS Kiro, Thoughtworks’s endorsement. When this many senior practitioners converge on a practice, it is worth taking seriously.

SDD: logical case vs empirical evidence

The honest assessment

The logical case is compelling. The empirical case does not yet exist.

No peer-reviewed, independently conducted study has quantified spec-driven development’s impact on productivity, code quality, or delivery speed. The claims of “10x performance improvement” circulating in the community are self-reported and often vendor-adjacent. They should be treated the same way the METR researchers treated vendor productivity claims: as hypotheses, not evidence.

Thoughtworks itself, while endorsing the practice, flagged a real risk: SDD’s emphasis on upfront specification can revert to waterfall antipatterns, heavy documentation requirements, big-bang releases, and the accumulated coordination costs that the agile movement spent two decades escaping. The lightweight, iterative end of the spectrum (Spec Kit, OpenSpec) is designed to avoid this. The heavyweight end (BMAD with 21 agents and 50 workflows) risks it.

The parallel to agile is instructive. The logic of iterative, feedback-driven development was sound in 2001. But how teams implemented it varied enormously, and the failures were usually process failures, not conceptual ones. SDD faces the same test. The idea is right. Execution will determine whether it delivers on the premise.

What to watch

SDD is where the live methodological questions in AI-assisted development are being worked out in 2026. The questions that the next 12 to 18 months of data should answer:

Does spec quality correlate with output quality in measurable ways? Can you demonstrate, with objective metrics, that teams using structured specs produce lower churn rates and fewer incidents than teams using ad hoc prompting?

Does SDD actually avoid the waterfall trap in practice, or does the discipline required to maintain good specs become its own kind of overhead?

And most importantly: at what team size and codebase complexity does SDD deliver enough value to justify the investment in process?

Apply the same scepticism here that the research demands everywhere else. Demand measured outcomes. The idea is promising. The evidence is still being collected.

Sources: Thoughtworks Technology Radar (2025). Piskala, arXiv (2026). GitHub Spec Kit. BMAD-METHOD. Fission-AI/OpenSpec. AWS Kiro. GitClear AI Copilot Code Quality Research. DORA 2025.

The 10% productivity plateau from unstructured AI use raises an obvious question. If the problem is that AI tools get inconsistent results from inconsistent inputs, what happens when you fix the input?

Spec-driven development claims to answer that question. But after 18 months of growing hype, the evidence base is almost nonexistent, and the historical pattern it most resembles is not encouraging.

No evidence after 18 months should be disqualifying

Let me be blunt about the state of the empirical case. There are zero peer-reviewed, independently conducted studies showing that spec-driven development improves productivity, code quality, or delivery speed. None.

The claims circulating in the community are self-reported. Many come from people who sell SDD tooling, write SDD content, or run SDD consultancies. “10x improvement” is the number that gets repeated. It is unverifiable and should be treated exactly the way the METR researchers treated vendor productivity claims: as marketing until proven otherwise.

This does not mean SDD cannot work. But 18 months is long enough for a practice with real impact to produce at least some independent measurement. The fact that it has not is a signal worth taking seriously.

GitHub stars measure curiosity, not outcomes

72,000 stars on GitHub Spec Kit. This gets cited as evidence of adoption and validation. It is neither.

GitHub stars measure curiosity and social proof. They are free to give, require no usage, and correlate weakly with actual adoption in production environments. Plenty of 50K+ star repositories represent ideas that people found interesting but never used seriously. Stars are a marketing metric. Treating them as evidence of effectiveness is the same category error as treating download counts as proof that a library works well.

AWS shipping Kiro is a stronger signal, but it tells you that AWS sees a market opportunity, not that the approach delivers measured results. AWS has shipped products that failed before. Commercial investment is a bet, not a validation.

The waterfall trap is not a minor risk

Thoughtworks flagged it and I think they understated it. SDD’s emphasis on upfront specification is structurally similar to the heavyweight requirements processes that the agile movement spent two decades dismantling. The argument that “our specs are lightweight and iterative” is the same argument that every requirements-heavy process makes in its early years.

The BMAD Method, with 21 AI agents and 50+ guided workflows, is already at the heavyweight end. But even Spec Kit’s Specify, Plan, Tasks, Implement workflow imposes a phase-gate structure that adds friction. In a fast-moving codebase, maintaining the spec alongside the code becomes its own coordination cost. Specs drift. Code drifts. Keeping them synchronised requires discipline that, in my experience, erodes under delivery pressure.

And here is the part that does not get discussed enough: the teams that struggle most with ad hoc prompting are the same teams that will struggle most with disciplined spec-writing. SDD assumes developers can write clear, structured specifications. Writing good specs is a skill. It requires understanding the problem domain well enough to articulate requirements precisely. The developers who can do that well are the same developers who were already getting decent results from ad hoc prompting. The ones who were not are unlikely to suddenly produce better outcomes because you gave them a template.

SDD transfers the problem without solving it

The core promise of SDD is: bad input produces bad output, so improve the input. This is logical. But it sidesteps the question of whether improving the input is actually easier than improving the output.

Ad hoc prompting produces inconsistent code. The SDD solution: write better prompts, structured as specs. But writing a good spec is at least as hard as writing good code, and arguably harder, because it requires you to think through requirements, edge cases, and constraints before any code exists. That is the hardest part of software development. It always has been.

SDD does not eliminate the hard problem. It moves it from “developer writes mediocre prompts and gets mediocre code” to “developer writes mediocre specs and gets mediocre code, but with more process overhead.” The bottleneck was never the format of the instruction. It was the quality of the thinking behind it.

What would change my mind

I would take SDD seriously if any of the following appeared: a controlled study comparing spec-driven and ad hoc AI workflows on equivalent tasks, with pre-registered methodology. Longitudinal data from teams that adopted SDD showing measurable improvement in churn rates, incident rates, or delivery frequency. Independent analysis (not from framework authors or their communities) of SDD’s cost-benefit tradeoff, including the overhead of spec maintenance.

Until then, SDD has the hallmarks of a practice that solves the problem people want solved (AI produces inconsistent code) without addressing the problem that actually exists (writing good software requirements is hard regardless of format). Apply the same scepticism here that the research demands everywhere else.

Sources: Thoughtworks Technology Radar (2025). Piskala, arXiv (2026). GitHub Spec Kit. BMAD-METHOD. Fission-AI/OpenSpec. AWS Kiro. GitClear AI Copilot Code Quality Research. DORA 2025.

Spec-driven development answers that question. And the answer, both from first principles and from early practitioner experience, is compelling enough that I think SDD will become the default way serious teams use AI within two years.

SDD directly addresses every documented failure mode

This is what makes the logical case so strong. It is not a generic “better process” argument. SDD targets the specific problems that the research has identified.

GitClear found code churn rising to 5.7% and duplicate code growing 4x as AI adoption increased. The mechanism is clear: AI tools without specs are guessing at intent. Every guess that is wrong produces code that needs revision or gets duplicated. A well-written spec removes the guessing. You are not hoping the AI infers your intent from a chat prompt. You are stating your intent explicitly, in a structured document, with acceptance criteria.

DORA identified version-controlled audit trails as a success factor for high-performing teams using AI. Spec files in version control create exactly that. You can diff a spec, review a spec change in a PR, and reference a spec in a post-mortem. The intent is preserved alongside the implementation.

The Specify, Plan, Task, Implement workflow enforces incremental delivery, preventing the big-batch failures that DORA associates with AI-driven instability. You cannot skip ahead because the workflow gates each phase.

This is not theoretical. Each of these connections maps directly to measured problems in the literature.

The market signal is a leading indicator

72,000 GitHub stars in six months. Enterprise adoption through AWS Kiro at $20/month. Thoughtworks endorsing it in the Technology Radar. These are not vanity metrics. They are senior practitioners voting with their time and their budgets.

I have watched enough technology adoption cycles to know what early convergence looks like. When independent groups, an open-source community, an enterprise vendor, and an analyst firm, arrive at the same conclusion within the same 12-month window, the underlying signal is almost always real. The details get refined, but the direction holds.

The empirical evidence is not here yet because the practice is 18 months old. Peer-reviewed studies take time. But waiting for perfect evidence before adopting a practice that addresses documented failure modes with a sound logical mechanism is not scepticism. It is inertia.

The 10% plateau is not a ceiling for spec-driven work

The productivity studies that show modest gains from AI are all measuring unstructured use: developers chatting with AI assistants, accepting autocomplete suggestions, using AI for boilerplate. These workflows do not give the AI enough context to do anything beyond local code generation.

SDD changes the input fundamentally. Instead of a prompt, the AI gets a specification with requirements, constraints, acceptance criteria, and architectural context. The output quality has to improve because the input quality is categorically different.

I think the “10x improvement” claims circulating in the community are premature and probably overstated. But the 10% plateau from ad hoc prompting is clearly not the upper bound for what structured AI development can achieve. The question is where the actual number lands, not whether it exceeds 10%.

The waterfall risk is real and manageable

Thoughtworks flagged it and they are right to. Any process that emphasises upfront specification can drift toward waterfall. But the lightweight end of the SDD spectrum is specifically designed to avoid this.

Spec Kit’s workflow is iterative by design. You write a spec, plan, decompose into tasks, implement, and iterate. The spec is a living document, not a 200-page requirements document that gets signed off and frozen. OpenSpec is even more change-centric, built for brownfield environments where you spec individual changes, not entire systems.

The heavyweight frameworks like BMAD carry more waterfall risk, and teams should be cautious there. But the dominant SDD tools are closer to agile than to waterfall. The risk exists, and it can be managed by choosing the right point on the spectrum.

Get started now

The teams that adopt SDD in 2026 will have 12 to 18 months of practice, tooling familiarity, and spec libraries by the time the empirical evidence confirms what practitioners already know: structured input produces structured output. The cost of starting is low, a markdown file and a workflow. The cost of waiting is falling behind teams that are compounding their spec-writing skill right now.

Sources: Thoughtworks Technology Radar (2025). Piskala, arXiv (2026). GitHub Spec Kit. BMAD-METHOD. Fission-AI/OpenSpec. AWS Kiro. GitClear AI Copilot Code Quality Research. DORA 2025.