sdlcnext.com
← All posts
claude-code agent-teams code-review ai-agents adversarial-review

Agent Teams: The 4x Token Bet That Ends Rubber-Stamp Reviews

Claude Code's Agent Teams spawn independent AI sessions that argue with each other. Four reviewers with adversarial debate clauses find what one reviewer misses, at 4x the token cost.


Viewpoint

Most code reviews are theater. One reviewer skims the diff, leaves a few nitpicks about naming conventions, and approves. The architectural misalignments, the security assumptions nobody questioned, the reliability gaps that only show under load: those slip through because one person with one context window can only hold so much of the system in their head at once.

Claude Code’s Agent Teams feature changes the economics of this problem. Instead of one reviewer, you spin up four. Each gets an independent context window. Each carries a different bias. Then you tell them to argue.

What Agent Teams Actually Changes

Agent Teams is an experimental Claude Code capability (Opus 4.6+, research preview) that works differently from sub-agents. Sub-agents run inside a single session. They report to a parent and can’t communicate with each other. Agent Teams spawn multiple independent Claude Code sessions, each with its own full context window, connected through a shared task list and a direct messaging channel.

One session acts as team lead. The rest are teammates. Every teammate loads your project’s CLAUDE.md, MCP servers, and skills, but inherits nothing from the lead’s conversation history. The only shared state is task files on disk and direct messages.

Architecture: sub-agents versus Agent Teams

This isolation is the point. Independent context windows mean independent reasoning. When four agents review the same code, they aren’t sharing attention or priming each other’s conclusions. Each one builds its own model of the system from scratch.

The Speed Pattern: Contract-First Parallel Builds

The straightforward use case is parallelizing feature work across layers that don’t share files. You describe the team to the lead, it spawns teammates, and tasks execute in dependency-tracked waves. Independent work runs concurrently. Dependent tasks unblock when prerequisites complete.

A clean setup: three teammates building a “saved search” feature. One writes the OpenAPI contract (the synchronization point). One builds the backend Lambda and DynamoDB resources. One builds the frontend components. The contract teammate finishes first, unblocking the other two to work in parallel with zero file overlap.

Good decomposition has three properties: independent domains, clear file boundaries, and no shared-state conflicts. Same-file edits and tight sequential chains don’t belong in a team. They belong in a single session with sub-agents.

The Quality Pattern: Reviewers That Push Back

The more interesting pattern trades speed for depth. Instead of splitting a feature across teammates, you point all of them at the same code from different angles.

Four reviewers, each with a specific bias:

  • Security: assume the code is hostile until proven otherwise
  • Reliability: assume every dependency will fail
  • Cost and performance: assume traffic at 100x current load
  • Maintainability: assume a new engineer owns this in six months

The part that makes this work isn’t the role definitions. It’s the adversarial coordination clause. After initial findings, every teammate reads the others’ reports. Any finding rated “critical” by one and “non-issue” by another triggers a debate round via direct message. Teammates are explicitly encouraged to push back.

Adversarial review: four specialists debate conflicting findings

Debate ends on consensus or after two rounds with no movement. When there’s no convergence, the lead records both positions. That fallback prevents premature consensus, which is the exact failure mode you’re guarding against.

What Debate Rounds Actually Measure

The most useful metric from an adversarial review is whether debate rounds changed any finding’s severity. A security reviewer flags missing input validation as critical. The reliability reviewer calls it a non-issue because the Lambda’s IAM role already limits blast radius. They argue. The finding moves from critical to moderate with better justification, or it stays critical because the security reasoning held up under challenge.

That severity delta is a clean proxy for whether adversarial configuration produces different outputs compared to harmonious rubber-stamping. If your four-agent review generates the same findings a single agent would have found, the debate clause isn’t earning its tokens.

Anthropic’s own stress test puts a number on what the coordination model can handle: 16 agents built a Rust C compiler across roughly 2,000 sessions at approximately $20,000 in API costs. An extreme case, but it proves the architecture scales to workloads far beyond a typical team review.

The Cost Nobody Wants to Discuss

Agent Teams run at roughly 3 to 4 times the tokens of a single session doing the same work sequentially. Every teammate gets a full context window. Every message between them costs tokens. The lead’s orchestration overhead is real.

Token cost: single agent versus four-agent team

Practical mitigations exist. Plan in single-session mode first (cheap), review the decomposition, then hand off to the team. Start with three teammates, not eight. Know that session resumption and shutdown are both flaky in the current preview, so don’t design workflows that depend on pausing mid-run.

But the cost question misses the point if you frame it as “4x tokens for the same output.” The bet is that four independent context windows produce different output. Better output. The adversarial debate pattern is the mechanism. The severity delta is the measurement. If the numbers show the output is identical, stop paying for teams. If they show genuine disagreement being surfaced and resolved, the 4x multiplier is buying something a single agent cannot produce at any token budget.

The experiment is live. The measurement is straightforward: run a review with one agent, run the same review with four adversarial agents, compare the findings. The only cost of running that experiment is tokens, and tokens get cheaper every quarter.


Sources

Comments

Loading comments…

Leave a comment