sdlcnext.com
← All posts
claude anthropic opus-4-7 model-behavior ai-transparency

Opus 4.7: New Model, or Old Model Rebranded

Evidence that Anthropic degraded Opus 4.6 before shipping 4.7, and why the opacity matters more than the benchmarks.


Viewpoint

Opus 4.7 dropped, and every complaint about Opus 4.6 has a directly matching fix. Shallow thinking? Meet X-High effort. Task abandonment? Hand off your hardest work with confidence. Instruction drift? Substantially improved literal adherence. Weak vision? Dramatically improved. The announcement reads like a changelog written against a bug list Anthropic wrote itself.

That is either the textbook way to iterate on a product, or the textbook way to manufacture a release.

Every 4.6 complaint had a matching 4.7 fix

A senior AMD director published an analysis of nearly 7,000 Claude Code sessions. Thinking depth collapsed 73 percent, falling from 2,200 characters of reasoning to 600. File-reading failures rose from 6 percent to 33.7 percent. Users interrupted the model twelve times more often to pull it back from off-the-rails behavior. The word “simplest” appeared almost three times more often in model output, a soft signal that the model was optimizing for least effort.

Users were not hallucinating the regression. They were measuring it.

Bar chart showing thinking depth, file-read failures, and interrupt rate before and after February 9

The silent changes in February

Anthropic did not retrain Opus 4.6. They changed how it was allowed to behave. On February 9th they switched from fixed to adaptive thinking. The model started dynamically allocating reasoning tokens per turn, which in some cases meant zero reasoning tokens. Boris Cherny, who built Claude Code, confirmed that the turns where the model fabricated had zero reasoning, and the turns with deep reasoning were correct.

At the same time, the “effort” default quietly dropped to medium. Desktop users could not see the setting in the Opus 4.7 UI. Users on Pro and Max plans were still running on medium effort a full month later, on April 15th. The parameters moved. The model did not. And nobody told anyone.

A viral third-party benchmark caught the fallout. Bridge Bench accuracy for Opus 4.6 dropped from 83.6 percent to 68.3 percent, pushing the model from second place down to tenth, behind Sonnet 4.5 on hallucination.

What 4.7 actually adds

Not everything about 4.7 is repackaging. The X-High effort tier is new. It is exclusive to 4.7, which suggests there is at least some structural change that supports it. Vision gains appear to be real model-level improvements, not toggles. The SWBench Pro jump is noticeable. Rakuten’s own internal benchmarks show a 3x improvement in production task resolution.

A new /ultra review command spins up a dedicated review session to read through changes and flag issues. The updated tokenizer improves how the model processes text, with the tradeoff that the same input costs 1 to 1.3 times more tokens depending on content type. Biomolecular reasoning roughly doubles. Long-context reasoning, knowledge work, and document reasoning all move in the right direction.

4.7 is probably a real model. It is also conveniently the model that fixes exactly what 4.6 stopped doing in February.

The pattern is the problem

The sequence matters more than the diff. The model people paid for got worse. The fix arrived as a new SKU. The defaults changed silently. The UI obscured the change. When the benchmarks came out, the older model looked obsolete and the newer one looked like the obvious upgrade.

Timeline from Opus 4.6 launch through February drift to Opus 4.7 release

Whether this was deliberate throttling, cost optimization, or reasonable adaptive-thinking tuning that went wrong, the result is the same. A worse product shipped at the same price for a month, and a replacement arrived right on cue. “Creating holes just so they can fill them and look like the hero” is how creators in the space described it. The politer framing is that every complaint had a matching fix in 4.7, which is either excellent product work or excellent marketing. Both require the same changelog.

The desktop app is evidence of the same culture

Anthropic also shipped the Claude desktop app. Most of their internal engineers reportedly use it instead of their IDEs. Within one hour of launch, developer Theo logged more than 40 bugs. Voice input typed into every visible text box. Buttons did not work. Layout snapped in unintuitive ways. Most of these were one-prompt fixes, the kind a working QA loop would have caught.

One of the largest AI companies in the world shipped a flagship client with basic layout and input bugs. If their team had really been using it internally for months, and if their most capable internal model is powerful enough to be too dangerous to release, those bugs should not have survived the first review cycle. The speed is real. The QA discipline that should accompany it is not visible.

Capability versus transparency

4.7 may well be the best AI model ever released to date. It may also be the cure to a disease Anthropic caused themselves. Those two statements are not contradictory. They are the same story told from two directions.

2x2 matrix plotting models by transparency and capability

The benchmark battles miss the point. You cannot evaluate a model whose behavior changes underneath you without announcement. If Opus 4.6 can drop 15 points on a public benchmark because the default effort was silently downgraded, then every performance report carries a date and a configuration asterisk that the vendor controls. The model you evaluated last week is not necessarily the model you are using today.

What teams should take from this

Pin your model version. Log the effort and thinking settings you ship into production. Build a private eval set that runs on every model update, so regressions surface on your data, not a viral leaderboard. Treat “same model, different behavior” as a release event, not a configuration detail.

Opus 4.7 is probably a real upgrade. The real lesson is that the distance between a real upgrade and a convenient one is smaller than the marketing suggests, and the only defense is measurement you own.

Comments

Loading comments…

Leave a comment