Opus 4.7 dropped, and every complaint about Opus 4.6 has a directly matching fix. Shallow thinking? Meet X-High effort. Task abandonment? Hand off your hardest work with confidence. Instruction drift? Substantially improved literal adherence. Weak vision? Dramatically improved. The announcement reads like a changelog written against a bug list Anthropic wrote itself.
That is either the textbook way to iterate on a product, or the textbook way to manufacture a release.
Every 4.6 complaint had a matching 4.7 fix
A senior AMD director published an analysis of nearly 7,000 Claude Code sessions. Thinking depth collapsed 73 percent, falling from 2,200 characters of reasoning to 600. File-reading failures rose from 6 percent to 33.7 percent. Users interrupted the model twelve times more often to pull it back from off-the-rails behavior. The word “simplest” appeared almost three times more often in model output, a soft signal that the model was optimizing for least effort.
Users were not hallucinating the regression. They were measuring it.

The silent changes in February
Anthropic did not retrain Opus 4.6. They changed how it was allowed to behave. On February 9th they switched from fixed to adaptive thinking. The model started dynamically allocating reasoning tokens per turn, which in some cases meant zero reasoning tokens. Boris Cherny, who built Claude Code, confirmed that the turns where the model fabricated had zero reasoning, and the turns with deep reasoning were correct.
At the same time, the “effort” default quietly dropped to medium. Desktop users could not see the setting in the Opus 4.7 UI. Users on Pro and Max plans were still running on medium effort a full month later, on April 15th. The parameters moved. The model did not. And nobody told anyone.
A viral third-party benchmark caught the fallout. Bridge Bench accuracy for Opus 4.6 dropped from 83.6 percent to 68.3 percent, pushing the model from second place down to tenth, behind Sonnet 4.5 on hallucination.
What 4.7 actually adds
Not everything about 4.7 is repackaging. The X-High effort tier is new. It is exclusive to 4.7, which suggests there is at least some structural change that supports it. Vision gains appear to be real model-level improvements, not toggles. The SWBench Pro jump is noticeable. Rakuten’s own internal benchmarks show a 3x improvement in production task resolution.
A new /ultra review command spins up a dedicated review session to read through changes and flag issues. The updated tokenizer improves how the model processes text, with the tradeoff that the same input costs 1 to 1.3 times more tokens depending on content type. Biomolecular reasoning roughly doubles. Long-context reasoning, knowledge work, and document reasoning all move in the right direction.
4.7 is probably a real model. It is also conveniently the model that fixes exactly what 4.6 stopped doing in February.
The pattern is the problem
The sequence matters more than the diff. The model people paid for got worse. The fix arrived as a new SKU. The defaults changed silently. The UI obscured the change. When the benchmarks came out, the older model looked obsolete and the newer one looked like the obvious upgrade.

Whether this was deliberate throttling, cost optimization, or reasonable adaptive-thinking tuning that went wrong, the result is the same. A worse product shipped at the same price for a month, and a replacement arrived right on cue. “Creating holes just so they can fill them and look like the hero” is how creators in the space described it. The politer framing is that every complaint had a matching fix in 4.7, which is either excellent product work or excellent marketing. Both require the same changelog.
The desktop app is evidence of the same culture
Anthropic also shipped the Claude desktop app. Most of their internal engineers reportedly use it instead of their IDEs. Within one hour of launch, developer Theo logged more than 40 bugs. Voice input typed into every visible text box. Buttons did not work. Layout snapped in unintuitive ways. Most of these were one-prompt fixes, the kind a working QA loop would have caught.
One of the largest AI companies in the world shipped a flagship client with basic layout and input bugs. If their team had really been using it internally for months, and if their most capable internal model is powerful enough to be too dangerous to release, those bugs should not have survived the first review cycle. The speed is real. The QA discipline that should accompany it is not visible.
Capability versus transparency
4.7 may well be the best AI model ever released to date. It may also be the cure to a disease Anthropic caused themselves. Those two statements are not contradictory. They are the same story told from two directions.

The benchmark battles miss the point. You cannot evaluate a model whose behavior changes underneath you without announcement. If Opus 4.6 can drop 15 points on a public benchmark because the default effort was silently downgraded, then every performance report carries a date and a configuration asterisk that the vendor controls. The model you evaluated last week is not necessarily the model you are using today.
What teams should take from this
Pin your model version. Log the effort and thinking settings you ship into production. Build a private eval set that runs on every model update, so regressions surface on your data, not a viral leaderboard. Treat “same model, different behavior” as a release event, not a configuration detail.
Opus 4.7 is probably a real upgrade. The real lesson is that the distance between a real upgrade and a convenient one is smaller than the marketing suggests, and the only defense is measurement you own.
The degrade-then-hero narrative is compelling, and mostly wrong. The facts underneath it are real, but they are being stitched into a conspiracy the evidence does not support. A more honest read starts from a simpler question. Would you have told the same story if Opus 4.7 had not shipped this month?
Adaptive thinking is an engineering choice, not a trap
Switching from fixed to adaptive thinking is exactly what a model team would do after seeing production data. Most user requests do not require 2,200 characters of reasoning. They require a short answer fast. Spending tokens on trivial turns is wasted compute for the provider and wasted latency for the user. Adaptive routing is a reasonable default, and almost every AI product is moving in that direction.
The failure mode was real. The model allocated zero reasoning tokens to turns that needed reasoning, and it fabricated as a result. That is a tuning error, not a scheme. Tuning errors happen, and teams fix them. The fact that Anthropic did not announce the switch loudly is a communication miss, not proof of intent.
The viral benchmark moment overstates the drop
Bridge Bench is one third-party benchmark run by people with their own priors. The 83.6 to 68.3 percent drop is real, but every benchmark has a confidence interval, and hallucination benchmarks are especially sensitive to prompt format, task distribution, and evaluator choices. Ranking from second to tenth is a dramatic headline, not a rigorous comparison across independent runs.
Anthropic’s own system card for Opus 4.6 never advertised 83.6 percent on a specific third-party test suite. The number was a snapshot from a community leaderboard. Treating it as the canonical capability metric is the same mistake every benchmark critic has spent the last three years pointing out. Benchmarks are signals, not contracts.
”Matching fixes” is what iteration looks like
The argument that every 4.6 complaint has a matching 4.7 fix is presented as suspicious. Strip the framing and what remains is a product team that listened to users and prioritized the top feedback categories. X-High effort is not a conspiracy. It is an obvious lever after learning that medium was a bad default. The /ultra review command exists because users asked for it. The tokenizer update was pre-announced in Anthropic’s research blog.
If the release notes did not match the complaints, the post-mortem would be even harsher. “Anthropic ignores user feedback and ships a new model that does not address any known regressions” is a worse story than the one currently being told. Product iteration aligned with user pain looks like responsiveness in most industries. Recasting it as stagecraft requires ignoring the base rate of how product releases actually work.

Desktop app bugs are a separate story
Forty bugs in an hour sounds damning, and also matches the launch day of almost every consumer app in recent memory. Slack, Linear, Notion, and Vercel have all shipped 1.0 clients that looked worse than their beta. Most of the bugs Theo found are layout glitches and input-handling edge cases, not behavioral regressions in the underlying model. They have nothing to do with whether Opus 4.6 was intentionally degraded.
Conflating UI launch bugs with model integrity is a category error. The desktop app is an Electron client built by a different team on a different timeline. Its bugs reveal application engineering maturity, not model-team ethics. Treating both as symptoms of the same culture is the kind of pattern matching that feels satisfying and explains nothing.
Correlation is not motive
The strongest version of the skeptic case is simple. Anthropic changed model configuration in February. Users noticed degraded behavior in March. Anthropic shipped a new version in April that addressed those behaviors. Every step in that sequence has a mundane explanation that does not require deliberate degradation.

Cost pressure from Max-plan users burning through daily token budgets would push a provider toward adaptive thinking. So would research into whether over-thinking hurt response quality on simple tasks. So would preparation for a larger infrastructure change that 4.7 eventually normalized. Any of those explanations fits the observed data. The hero-villain narrative is not required.
Version pinning is already standard practice
The dominant read argues that teams without private eval sets are running blind. That is true in a trivial sense. Any team serious enough to deploy an LLM in production without an eval set and model version pinning already failed basic engineering hygiene years ago. The advice is correct and not new. It applies to OpenAI, Google, Meta, and every open-source model with a changing default quantization. It is not a specifically anti-Anthropic posture.
Calling for eval discipline only after the Opus 4.7 release is reactive. The teams that had this right in January are unaffected. The teams that did not are learning what has been true for three years.
The real takeaway is smaller

Opus 4.7 looks like a real upgrade. Adaptive thinking with better routing is probably net positive across most workloads. The X-High effort tier gives power users a lever they asked for. Improved vision and biomolecular reasoning are structural gains, not toggles. Most users will find the new version a better product.
The valid residual complaint is about communication. Anthropic should have announced the February 9th adaptive thinking change in a changelog, not left it for users to discover. That is a documentation bug, not a business model. It can be fixed next release. The story is not that a major AI lab is manipulating its customers. The story is that AI providers are still not shipping changelogs at the maturity level their enterprise customers expect. That is worth pushing on, but it is a smaller claim than the dominant narrative wants it to be.
Anthropic ran a play. Opus 4.6 got worse in February. Opus 4.7 fixed it in April. The sequence reads like a release strategy, not a coincidence, and any team still assuming unversioned model behavior is stable is already paying for the lesson.
The degradation was real and it was measurable
Nearly 7,000 Claude Code sessions, studied by a senior AMD director, showed thinking depth collapsing 73 percent. File-reading failures went from 6 percent to 33.7 percent. Interrupt rate climbed 12x. The word “simplest” appeared almost three times more often in outputs, a canary for effort collapse. Bridge Bench accuracy fell from 83.6 percent to 68.3 percent, dropping Opus 4.6 from second place to tenth. This is not anecdote. This is telemetry.

Users noticed because users measure. The people burning through $200-a-month Max plans in an hour were not imagining things. They had receipts.
The mechanism was never a model change
Opus 4.6 was not retrained. It was reconfigured. On February 9th, Anthropic switched fixed thinking to adaptive thinking, which means the model now decides whether a request is worth any reasoning tokens at all. The creator of Claude Code confirmed that the turns where the model hallucinated had zero reasoning. The effort default silently dropped to medium. Users on Pro and Max plans were still defaulting to medium on April 15th, a full month later, and the Opus 4.7 UI did not expose the setting at all.
The configuration moved. The model did not. The experience was a different model. The name stayed the same.
Every 4.6 complaint has a matching 4.7 fix
The release notes read like a bug report Anthropic authored themselves. Shallow thinking, fixed by X-High effort, exclusive to 4.7. Task abandonment, fixed by “hand off your hardest work with confidence.” Instruction drift, fixed by substantially improved literal adherence. Hallucinations, fixed by the ability to catch logical faults during planning. Weak vision, fixed by dramatic vision improvements. Safety concerns after the Opus 4.6 blackmail incident, addressed under a new safety and alignment section.
That is a repackaged SKU. Opus 4.7 may also be a genuinely better model. Both can be true. Both are how the play works.
The desktop app tells you everything about QA discipline
Anthropic shipped a desktop app reportedly used by most of their engineers internally. Within an hour of launch, developer Theo found more than 40 bugs. Voice input went into every visible text box simultaneously. Buttons did not work. Layouts snapped unexpectedly. These are not edge cases. They are first-click issues a single manual pass would have caught.
If the flagship client ships this broken, the internal eval discipline on the model layer cannot be assumed to be tighter. The same culture that ships 40-bug UIs ships silent default changes. The speed is real. The review gates are not.
The pattern repeats because the market rewards it

This will happen again, because there is no structural force against it. Customers cannot diff model behavior between versions. Benchmarks lag. Developer tools do not expose the effort parameter. Teams without private eval sets are running blind. Every quarter the vendor makes a configuration change, measured capability drops, and a new release arrives with the fix. The cost to the vendor is zero. The cost to customers is measured in wasted tokens, missed deadlines, and cancelled subscriptions.
The only thing that stops this cycle is customers who measure. The market has not forced that yet. It will.
The right axis is transparency, not capability

Benchmark wars are distracting. A model with publicly documented versioning, pinned effort parameters, and changelog-level communication about behavioral defaults is more valuable than a 5-point benchmark win over a model with opaque configuration. When you buy Opus 4.7, you are also buying the policy that says Anthropic can silently change how it behaves next Tuesday. That policy is a liability you are not pricing.
Teams that pin versions have leverage. Teams that do not are exposed to the next Feb 9th.
What engineering leaders should do this week
Pin the exact model version in every production call. Log the effort and thinking settings alongside every inference. Stand up a private eval set that runs on every model update, with pass-fail thresholds tied to your real business tasks. When a vendor ships a configuration change, you want to see the regression on your data before a viral leaderboard catches it.
If you are on a Max plan and burning through session limits faster than expected, that is a signal, not a usage pattern. Investigate it. If your model started abandoning tasks mid-run, that is a signal. Investigate it. The next version of this story happens with Sonnet, or with a competitor, or with a model that does not exist yet. The defense is the same either way.
4.7 is probably a better model. Anthropic probably also degraded 4.6 on purpose, or by accident, or through a tuning choice that went wrong. The motive does not matter. The control does. Buy the capability. Build the measurement. And stop assuming the model you evaluated is the model you are running.
Comments
Loading comments…
Leave a comment