×

注意!页面内容来自https://www.anthropic.com/news/claude-opus-4-6,本站不储存任何内容,为了更好的阅读体验进行在线解析,若有广告出现,请及时反馈。若您觉得侵犯了您的利益,请通知我们进行删除,然后访问 原网页

Announcements

Introducing Claude Opus 4.6

Feb 52026

We’re upgrading our smartest model.

The new Claude Opus 4.6 improves on its predecessor’s coding skills. It plans more carefullysustains agentic tasks for longercan operate more reliably in larger codebasesand has better code review and debugging skills to catch its own mistakes. Andin a first for our Opus-class modelsOpus 4.6 features a 1M token context window in beta1.

Opus 4.6 can also apply its improved abilities to a range of everyday work tasks: running financial analysesdoing researchand using and creating documentsspreadsheetsand presentations. Within Coworkwhere Claude can multitask autonomouslyOpus 4.6 can put all these skills to work on your behalf.

The model’s performance is state-of-the-art on several evaluations. For exampleit achieves the highest score on the agentic coding evaluation Terminal-Bench 2.0 and leads all other frontier models on Humanity’s Last Exama complex multidisciplinary reasoning test. On GDPval-AA—an evaluation of performance on economically valuable knowledge work tasks in financelegaland other domains2—Opus 4.6 outperforms the industry’s next-best model (OpenAI’s GPT-5.2) by around 144 Elo points,3 and its own predecessor (Claude Opus 4.5) by 190 points. Opus 4.6 also performs better than any other model on BrowseCompwhich measures a model’s ability to locate hard-to-find information online.

As we show in our extensive system cardOpus 4.6 also shows an overall safety profile as good asor better thanany other frontier model in the industrywith low rates of misaligned behavior across safety evaluations.

In Claude Codeyou can now assemble agent teams to work on tasks together. On the APIClaude can use compaction to summarize its own context and perform longer-running tasks without bumping up against limits. We’re also introducing adaptive thinkingwhere the model can pick up on contextual clues about how much to use its extended thinkingand new effort controls to give developers more control over intelligencespeedand cost.

We’ve made substantial upgrades to Claude in Exceland we’re releasing Claude in PowerPoint in a research preview. This makes Claude much more capable for everyday work.

Claude Opus 4.6 is available today on claude.aiour APIand all major cloud platforms. If you’re a developeruse claude-opus-4-6 via the Claude API. Pricing remains the same at $5/$25 per million tokens; for full detailssee our pricing page.

We cover the modelour new product updatesour evaluationsand our extensive safety testing in depth below.

First impressions

We build Claude with Claude. Our engineers write code with Claude Code every dayand every new model first gets tested on our own work. With Opus 4.6we’ve found that the model brings more focus to the most challenging parts of a task without being told tomoves quickly through the more straightforward partshandles ambiguous problems with better judgmentand stays productive over longer sessions.

Opus 4.6 often thinks more deeply and more carefully revisits its reasoning before settling on an answer. This produces better results on harder problemsbut can add cost and latency on simpler ones. If you’re finding that the model is overthinking on a given taskwe recommend dialing effort down from its default setting (high) to medium. You can control this easily with the /effort parameter.

Here are some of the things our Early Access partners told us about Claude Opus 4.6including its propensity to work autonomously without hand-holdingits success where previous models failedand its effect on how teams work:

Evaluating Claude Opus 4.6

Across agentic codingcomputer usetool usesearchand financeOpus 4.6 is an industry-leading modeloften by a wide margin. The table below shows how Claude Opus 4.6 compares to our previous models and to other industry models on a variety of benchmarks.

Benchmark table comparing Opus 4.6 to other models

Opus 4.6 is much better at retrieving relevant information from large sets of documents. This extends to long-context taskswhere it holds and tracks information over hundreds of thousands of tokens with less driftand picks up buried details that even Opus 4.5 would miss.

A common complaint about AI models is “context rot,” where performance degrades as conversations exceed a certain number of tokens. Opus 4.6 performs markedly better than its predecessors: on the 8-needle 1M variant of MRCR v2—a needle-in-a-haystack benchmark that tests a model’s ability to retrieve information “hidden” in vast amounts of text—Opus 4.6 scores 76%whereas Sonnet 4.5 scores just 18.5%. This is a qualitative shift in how much context a model can actually use while maintaining peak performance.

All in allOpus 4.6 is better at finding information across long contextsbetter at reasoning after absorbing that informationand has substantially better expert-level reasoning abilities in general.

Finallythe charts below show how Claude Opus 4.6 performs on a variety of benchmarks that assess its software engineering skillsmultilingual coding abilitylong-term coherencecybersecurity capabilitiesand its life sciences knowledge.

A step forward on safety

These intelligence gains do not come at the cost of safety. On our automated behavioral auditOpus 4.6 showed a low rate of misaligned behaviors such as deceptionsycophancyencouragement of user delusionsand cooperation with misuse. Overallit is just as well-aligned as its predecessorClaude Opus 4.5which was our most-aligned frontier model to date. Opus 4.6 also shows the lowest rate of over-refusals—where the model fails to answer benign queries—of any recent Claude model.

Bar charts comparing Opus 4.6 to other Claude models on overall misaligned behavior
The overall misaligned behavior score for each recent Claude model on our automated behavioral audit (described in full in the Claude Opus 4.6 system card).

For Claude Opus 4.6we ran the most comprehensive set of safety evaluations of any modelapplying many different tests for the first time and upgrading several that we’ve used before. We included new evaluations for user wellbeingmore complex tests of the model’s ability to refuse potentially dangerous requestsand updated evaluations of the model’s ability to surreptitiously perform harmful actions. We also experimented with new methods from interpretabilitythe science of the inner workings of AI modelsto begin to understand why the model behaves in certain ways—andultimatelyto catch problems that standard testing might miss.

A detailed description of all capability and safety evaluations is available in the Claude Opus 4.6 system card.

We’ve also applied new safeguards in areas where Opus 4.6 shows particular strengths that might be put to dangerous as well as beneficial uses. In particularsince the model shows enhanced cybersecurity abilitieswe’ve developed six new cybersecurity probes—methods of detecting harmful responses—to help us track different forms of potential misuse.

We’re also accelerating the cyberdefensive uses of the modelusing it to help find and patch vulnerabilities in open-source software (as we describe in our new cybersecurity blog post). We think it’s critical that cyberdefenders use AI models like Claude to help level the playing field. Cybersecurity moves fastand we’ll be adjusting and updating our safeguards as we learn more about potential threats; in the near futurewe may institute real-time intervention to block abuse.

Product and API updates

We’ve made substantial updates across ClaudeClaude Codeand the Claude Platform to let Opus 4.6 perform at its best.

Claude Platform

On the APIwe’re giving developers better control over model effort and more flexibility for long-running agents. To do sowe’re introducing the following features:

  • Adaptive thinking. Previouslydevelopers only had a binary choice between enabling or disabling extended thinking. Nowwith adaptive thinkingClaude can decide when deeper reasoning would be helpful. At the default effort level (high)the model uses extended thinking when usefulbut developers can adjust the effort level to make it more or less selective.
  • Effort. There are now four effort levels to choose from: lowmediumhigh (default)and max. We encourage developers to experiment with different options to find what works best.
  • Context compaction (beta). Long-running conversations and agentic tasks often hit the context window. Context compaction automatically summarizes and replaces older context when the conversation approaches a configurable thresholdletting Claude perform longer tasks without hitting limits.
  • 1M token context (beta). Opus 4.6 is our first Opus-class model with 1M token context. Premium pricing applies for prompts exceeding 200k tokens ($10/$37.50 per million input/output tokens)available only on the Claude Platform.
  • 128k output tokens. Opus 4.6 supports outputs of up to 128k tokenswhich lets Claude complete larger-output tasks without breaking them into multiple requests.
  • US-only inference. For workloads that need to run in the United StatesUS-only inference is available at 1.1× token pricing.

Product updates

Across Claude and Claude Codewe’ve added features that allow knowledge workers and developers to tackle harder tasks with more of the tools they use every day.

We’ve introduced agent teams in Claude Code as a research preview. You can now spin up multiple agents that work in parallel as a team and coordinate autonomously—best for tasks that split into independentread-heavy work like codebase reviews. You can take over any subagent directly using Shift+Up/Down or tmux.

Claude now also works better with the office tools you already use. Claude in Excel handles long-running and harder tasks with improved performanceand can plan before actingingest unstructured data and infer the right structure without guidanceand handle multi-step changes in one pass. Pair that with Claude in PowerPointand you can first process and structure your data in Excelthen bring it to life visually in PowerPoint. Claude reads your layoutsfontsand slide masters to stay on brandwhether you’re building from a template or generating a full deck from a description. Claude in PowerPoint is now available in research preview for MaxTeamand Enterprise plans.

Footnotes

[1] The 1M token context window is currently available in beta on the Claude Developer Platform only.

[2] Run independently by Artificial Analysis. See here for full methodological details.

[3] This translates into Claude Opus 4.6 obtaining a higher score than GPT-5.2 on this eval approximately 70% of the time (where 50% of the time would have implied parity in the scores).

  • For GPT-5.2 and Gemini 3 Pro modelswe compared the best reported model version in the charts and table.
  • Terminal-Bench 2.0: We report both scores reproduced on our infrastructure and published scores from other labs. All runs used the Terminus-2 harnessexcept for OpenAI’s Codex CLI. All experiments used 1× guaranteed / 3× ceiling resource allocation and 5–15 samples per task across staggered batches. See system card for details.
  • Humanity’s Last Exam: Claude models run “with tools” were run with web searchweb fetchcode executionprogrammatic tool callingcontext compaction triggered at 50k tokens up to 3M total tokensmax reasoning effortand adaptive thinking enabled. A domain blocklist was used to decontaminate eval results. See system card for more details.
  • SWE-bench Verified: Our score was averaged over 25 trials. With a prompt modificationwe saw a score of 81.42%.
  • MCP Atlas: Claude Opus 4.6 was run with max effort. When run at high effortit reached an industry-leading score of 62.7%.
  • BrowseComp: Claude models were run with web searchweb fetchprogrammatic tool callingcontext compaction triggered at 50k tokens up to 10M total tokensmax reasoning effortand no thinking enabled. Adding a multi-agent harness increased scores to 86.8%. See system card for more details.
  • ARC AGI 2: Claude Opus 4.6 was run with max effort and a 120k thinking budget score.
  • CyberGym: Claude models were run on no thinkingdefault efforttemperatureand top_p. The model was also given a “think” tool that allowed interleaved thinking for multi-turn evaluations.
  • OpenRCA: For each failure case in OpenRCAClaude receives 1 point if all generated root-cause elements match the ground-truth onesand 0 points if any mismatch is identified. The overall accuracy is the average score across all failure cases. The benchmark was run on the benchmark author’s harnessgraded using their official methodologyand has been submitted for official verification.

[Feb 232026] Updated reported score for Opus 4.6 for HLE with tools (53.1% to 53.0%). The update was caused by running an improved cheating detection pipeline which flagged 3 additional instances of cheating that our original pipeline had missed.

Related content

Anthropic invests $100 million into the Claude Partner Network

We’re launching the Claude Partner Networka program for partner organizations helping enterprises adopt Claude.

Read more

Introducing The Anthropic Institute

We’re launching The Anthropic Institutea new effort to confront the most significant challenges that powerful AI will pose to our societies.

Read more

Sydney will become Anthropic’s fourth office in Asia-Pacific

Read more
Claude Opus 4.6 \ Anthropic