Agent Teams: The 4x Token Bet That Ends Rubber-Stamp Reviews

Most code reviews are theater. One reviewer skims the diff, leaves a few nitpicks about naming conventions, and approves. The architectural misalignments, the security assumptions nobody questioned, the reliability gaps that only show under load: those slip through because one person with one context window can only hold so much of the system in their head at once.

Claude Code’s Agent Teams feature changes the economics of this problem. Instead of one reviewer, you spin up four. Each gets an independent context window. Each carries a different bias. Then you tell them to argue.

What Agent Teams Actually Changes

Agent Teams is an experimental Claude Code capability (Opus 4.6+, research preview) that works differently from sub-agents. Sub-agents run inside a single session. They report to a parent and can’t communicate with each other. Agent Teams spawn multiple independent Claude Code sessions, each with its own full context window, connected through a shared task list and a direct messaging channel.

One session acts as team lead. The rest are teammates. Every teammate loads your project’s CLAUDE.md, MCP servers, and skills, but inherits nothing from the lead’s conversation history. The only shared state is task files on disk and direct messages.

Architecture: sub-agents versus Agent Teams

This isolation is the point. Independent context windows mean independent reasoning. When four agents review the same code, they aren’t sharing attention or priming each other’s conclusions. Each one builds its own model of the system from scratch.

The Speed Pattern: Contract-First Parallel Builds

The straightforward use case is parallelizing feature work across layers that don’t share files. You describe the team to the lead, it spawns teammates, and tasks execute in dependency-tracked waves. Independent work runs concurrently. Dependent tasks unblock when prerequisites complete.

A clean setup: three teammates building a “saved search” feature. One writes the OpenAPI contract (the synchronization point). One builds the backend Lambda and DynamoDB resources. One builds the frontend components. The contract teammate finishes first, unblocking the other two to work in parallel with zero file overlap.

Good decomposition has three properties: independent domains, clear file boundaries, and no shared-state conflicts. Same-file edits and tight sequential chains don’t belong in a team. They belong in a single session with sub-agents.

The Quality Pattern: Reviewers That Push Back

The more interesting pattern trades speed for depth. Instead of splitting a feature across teammates, you point all of them at the same code from different angles.

Four reviewers, each with a specific bias:

Security: assume the code is hostile until proven otherwise
Reliability: assume every dependency will fail
Cost and performance: assume traffic at 100x current load
Maintainability: assume a new engineer owns this in six months

The part that makes this work isn’t the role definitions. It’s the adversarial coordination clause. After initial findings, every teammate reads the others’ reports. Any finding rated “critical” by one and “non-issue” by another triggers a debate round via direct message. Teammates are explicitly encouraged to push back.

Adversarial review: four specialists debate conflicting findings

Debate ends on consensus or after two rounds with no movement. When there’s no convergence, the lead records both positions. That fallback prevents premature consensus, which is the exact failure mode you’re guarding against.

What Debate Rounds Actually Measure

The most useful metric from an adversarial review is whether debate rounds changed any finding’s severity. A security reviewer flags missing input validation as critical. The reliability reviewer calls it a non-issue because the Lambda’s IAM role already limits blast radius. They argue. The finding moves from critical to moderate with better justification, or it stays critical because the security reasoning held up under challenge.

That severity delta is a clean proxy for whether adversarial configuration produces different outputs compared to harmonious rubber-stamping. If your four-agent review generates the same findings a single agent would have found, the debate clause isn’t earning its tokens.

Anthropic’s own stress test puts a number on what the coordination model can handle: 16 agents built a Rust C compiler across roughly 2,000 sessions at approximately $20,000 in API costs. An extreme case, but it proves the architecture scales to workloads far beyond a typical team review.

The Cost Nobody Wants to Discuss

Agent Teams run at roughly 3 to 4 times the tokens of a single session doing the same work sequentially. Every teammate gets a full context window. Every message between them costs tokens. The lead’s orchestration overhead is real.

Token cost: single agent versus four-agent team

Practical mitigations exist. Plan in single-session mode first (cheap), review the decomposition, then hand off to the team. Start with three teammates, not eight. Know that session resumption and shutdown are both flaky in the current preview, so don’t design workflows that depend on pausing mid-run.

But the cost question misses the point if you frame it as “4x tokens for the same output.” The bet is that four independent context windows produce different output. Better output. The adversarial debate pattern is the mechanism. The severity delta is the measurement. If the numbers show the output is identical, stop paying for teams. If they show genuine disagreement being surfaced and resolved, the 4x multiplier is buying something a single agent cannot produce at any token budget.

The experiment is live. The measurement is straightforward: run a review with one agent, run the same review with four adversarial agents, compare the findings. The only cost of running that experiment is tokens, and tokens get cheaper every quarter.

Sources

The pitch for Agent Teams sounds compelling: spin up four AI reviewers instead of one, give them different biases, make them argue. Better output through structured disagreement. But the gap between that promise and the current reality deserves more scrutiny than it’s getting.

”Independent Context” Is Not Independent Thinking

The architectural distinction between sub-agents and Agent Teams is real. Teammates get separate context windows. They don’t inherit each other’s conversation history. But they all load the same CLAUDE.md, the same MCP servers, the same skills. They run the same model with the same training data.

Architecture: sub-agents versus Agent Teams

Independent context windows produce independent attention patterns, not independent worldviews. A security reviewer and a reliability reviewer will see different surface-level issues, but their deeper assumptions about code quality, architecture, and risk come from the same model weights. Whether “different bias prompts on the same foundation model” generates genuinely orthogonal analysis or just slightly different angles on one underlying perspective is a question that doesn’t have a public answer yet.

The Decomposition Problem Is Harder Than It Looks

The speed pattern (parallel builds across non-overlapping files) works when the decomposition is clean. Real codebases resist clean decomposition. Shared state lives in databases, message queues, environment variables, and implicit contracts between modules. File boundaries don’t map neatly to conceptual boundaries.

The prescription is strict: independent domains, clear file boundaries, no shared-state conflicts. That reads like a constraint on the code, not on the tool. If your codebase already has clean domain boundaries, you could probably parallelize with simpler tooling. If it doesn’t, Agent Teams won’t fix the underlying coupling. It will just let you discover the coupling failures in parallel.

Adversarial Debate or Adversarial Theater?

The debate clause is the selling point of the quality pattern. But “debate” between four instances of the same model has a constraint that rarely gets mentioned: they share the same loss function. When a security reviewer and a reliability reviewer argue about severity, the debate resolves through the same underlying model’s sense of what a reasonable conclusion looks like.

Adversarial review: four specialists debate conflicting findings

Real adversarial review in human teams works because the participants bring genuinely different experiences, risk tolerances, and organizational knowledge. A senior SRE who lived through a production outage last quarter reasons differently from a security engineer who just finished an audit. The disagreement is grounded in distinct histories, not just different checklists.

AI “disagreement” is grounded in prompt engineering. That might be enough to surface some findings that a single pass misses. But calling it “adversarial” overstates what’s happening. It’s closer to the same reviewer doing four passes with different checklists. That has value, but it’s not the paradigm shift the framing implies.

The Severity Delta Might Not Measure What You Think

The proposed metric (whether debate rounds change any finding’s severity) is clean. But it conflates two things: the debate producing genuinely new insight, and the debate redistributing confidence across the same conclusions.

If a security reviewer flags missing input validation as critical and a reliability reviewer downgrades it to moderate because IAM limits the blast radius, did the debate produce new information? Or did the model’s tendency toward consensus just find a middle position? You’d need a human expert evaluating the pre-debate and post-debate findings to tell the difference. At that point you’re back to the manual review you were trying to replace.

4x Tokens for an Unproven Multiplier

Three to four times the tokens is a concrete cost. The benefit is speculative until someone publishes controlled comparisons: same codebase, same review scope, single agent versus adversarial team, findings evaluated by human experts.

Token cost: single agent versus four-agent team

Anthropic’s C compiler stress test (16 agents, ~$20,000) proves the coordination model doesn’t collapse under load. It does not prove the output is better than what a careful single-agent workflow would produce with the same token budget spent on longer context, more iterations, or chain-of-thought reasoning.

The honest recommendation right now: run the experiment on code where you have expert human reviewers who can evaluate the findings independently. Compare the agent team output to the single-agent output and to the human output. If the team finds things neither the single agent nor the human caught, that’s a real result. Until that data exists, the 4x cost is a bet on a hypothesis, not a proven trade.

Session resumption is flaky. Shutdown leaves orphaned processes. Those are fixable engineering problems. The deeper question (whether same-model “debate” generates genuinely different analysis) is harder to answer and matters more.

Sources

Single-agent code reviews are broken and everyone knows it. One context window, one perspective, one pass through the diff. The result is predictable: surface-level nitpicks get caught, structural problems sail through, and nobody discovers the gap until production does.

Agent Teams fixes this by making disagreement a first-class operation. Four independent Claude Code sessions, four different biases, and an explicit rule: if you disagree on severity, you argue it out in writing until one side concedes or both positions go on record.

Independent Context Windows Are the Whole Point

Sub-agents share a parent’s context. They see what the parent saw, inherit its assumptions, and reinforce its blind spots. Agent Teams sessions start clean. Each teammate loads the project’s CLAUDE.md and nothing else from the lead’s conversation.

Architecture: sub-agents versus Agent Teams

This isn’t a minor implementation detail. It’s the reason the feature exists. Independent reasoning requires independent context. Anything less is one opinion wearing four hats.

Speed: Parallel Builds That Actually Parallelize

The speed pattern works when you get decomposition right. Split a feature across teammates with zero file overlap. Use a contract (OpenAPI spec, shared schema, interface definition) as the synchronization point. The contract teammate finishes first. The implementation teammates parallelize from there.

Teams already doing this are compressing multi-day feature builds into hours. The rules are non-negotiable: independent domains, clear file boundaries, no shared-state conflicts. Violate them and you get coordination overhead without the parallelism payoff.

Anthropic ran 16 agents across roughly 2,000 sessions building a Rust C compiler. The coordination model held. The architecture scales to workloads far larger than most teams will attempt.

Quality: The Debate Clause Changes the Output

Four reviewers examining the same code through different lenses (security, reliability, cost, maintainability) will surface different findings. Expected. What changes the outcome is the adversarial coordination rule: conflicting severity ratings trigger mandatory debate rounds via direct message.

Adversarial review: four specialists debate conflicting findings

A security reviewer flags missing input validation as critical. A reliability reviewer calls it a non-issue because IAM constraints already limit the blast radius. Without the debate clause, both findings land in the report at their original severity and the human picks one. With the clause, the agents confront each other’s reasoning before the human sees anything.

Findings that survive a debate round carry the reasoning that withstood direct challenge. They are qualitatively different from uncontested assertions. The severity delta between pre-debate and post-debate findings is your measurement: if debate changes nothing, the agents agree and you’ve confirmed the findings. If debate changes severity ratings, you’ve found problems that a single-agent review would have missed or misjudged.

The 4x Cost Is the Wrong Frame

Agent Teams cost 3 to 4 times the tokens of a single session. Every teammate gets a full context window. Every message adds to the bill.

Token cost: single agent versus four-agent team

The comparison people keep making is “4x tokens for the same review.” That’s wrong. The real comparison is 4x tokens for a review that surfaces disagreements a single agent structurally cannot find. One context window holds one perspective. Four context windows hold four. The token cost buys independent reasoning, the only thing that catches what your single-pass reviewer keeps missing.

Mitigate the spend: plan in single-session mode first, review the decomposition, then create the team. Start with three teammates. Accept the preview limitations (session resumption and shutdown are both rough) and don’t build workflows that depend on pausing mid-run.

Tokens get cheaper every quarter. The bugs that reach production because nobody pushed back on the review don’t.

Agent Teams: The 4x Token Bet That Ends Rubber-Stamp Reviews

What Agent Teams Actually Changes

The Speed Pattern: Contract-First Parallel Builds

The Quality Pattern: Reviewers That Push Back

What Debate Rounds Actually Measure

The Cost Nobody Wants to Discuss

Sources

”Independent Context” Is Not Independent Thinking

The Decomposition Problem Is Harder Than It Looks

Adversarial Debate or Adversarial Theater?

The Severity Delta Might Not Measure What You Think

4x Tokens for an Unproven Multiplier

Sources

Independent Context Windows Are the Whole Point

Speed: Parallel Builds That Actually Parallelize

Quality: The Debate Clause Changes the Output

The 4x Cost Is the Wrong Frame

Sources

Comments

Leave a comment