Claude Opus 4.6 Drops with Million-Token Context

Anthropic just released Claude Opus 4.6, and the timing feels pointed. The same week $300 billion evaporated from major SaaS companies' market cap, here comes an AI model that can work autonomously for longer, think more deeply across massive codebases, and now—critically—maintain coherence across a million-token context window.

That last part is the headline feature, but it's also the most misunderstood one.

The Context Window Arms Race Nobody Asked For

A million tokens sounds impressive until you remember that Google's Gemini already offers this. What matters isn't the size of the window—it's whether the model can actually use it without falling apart.

Matthew Berman, who got early access to Opus 4.6, explains the core problem: "It's not just about having a large context window. You also have to be able to maintain the quality across those million tokens." This phenomenon, called context rot, is what happens when models lose track of important information buried in massive contexts. They technically see all the tokens, but they can't connect the dots between them.

Anthropic's benchmarks suggest Opus 4.6 handles this better than previous models. On the "needle in a haystack" test—where eight specific pieces of information get dropped into a huge context and the model has to retrieve them—Opus 4.6 hit 93% accuracy at 256K tokens and 76% at the full million. That's a meaningful drop when you 4x the context size, but it's still substantially better than Sonnet 4.5's performance.

The question is whether 76% accuracy at a million tokens is actually useful in real-world applications, or if it's just a benchmark flex. Developers working with large codebases might have an answer soon.

The Agentic Endurance Test

The other major improvement is how long Opus 4.6 can run autonomously without derailing. Berman shows a chart tracking "the time horizon of software engineering tasks different LLMs can complete at 50% of the time"—basically, how long can these models work independently before they either solve the problem or give up.

The chart is "a completely vertical line," according to Berman. GPT-5.2 was already hitting 6.5 hours of autonomous work. Where Opus 4.6 lands on that chart isn't clear yet, but Anthropic's focus on "sustains agentic tasks for longer" suggests they're chasing the same metric.

This matters because the real value proposition of these models isn't one-shot answers—it's sustained work over time. The ability to plan, execute, debug, and [iterate without constant human intervention. Anthropic says Opus 4.6 "plans more carefully, sustains agentic tasks for longer, can operate more reliably in larger code bases, and has better code review and debugging skills to catch its own mistakes."

That last bit—catching its own mistakes—might be the most consequential upgrade. Error correction during execution is what separates useful autonomous agents from expensive hallucination machines.

Agent Teams: More Power, Way More Tokens

Oppus 4.6 introduces "agent teams," which Anthropic positions as distinct from sub-agents. The difference: in a traditional sub-agent setup, smaller agents report back to a main coordinator. With agent teams, each teammate operates independently in its own context window and can communicate directly with other teammates.

Berman's immediate reaction: "All I hear when I'm reading this is tokens, tokens, tokens." He's not wrong. Spinning up multiple Claude Code instances running in parallel means multiplying token costs across every agent. Anthropic acknowledges this: "Agent teams add coordination overhead... and use significantly more tokens than a single session."

The recommended use cases are tasks where parallel exploration adds value—research with competing hypotheses, debugging complex systems, building new features that require cross-layer coordination. Whether the performance gains justify the token costs depends entirely on your use case and budget.

The Benchmark Parade

Berman runs through a stack of benchmarks, and Opus 4.6 consistently lands ahead of GPT-5.2 and Gemini 3 Pro. On OpenAI's own GDPVal benchmark for knowledge work, Opus 4.6 scores 1616 ELO versus GPT-5.2's 1462. On BrowseComp (agentic search), it hits 84 versus 64 for the previous Opus version.

Box, which sponsored Berman's video and got early access, ran enterprise-focused evaluations across thousands of documents. They saw report drafting scores jump from 36% to 75% between Opus 4.5 and 4.6. Due diligence tasks improved from 45% to 51%. In life sciences and healthcare specifically, the benchmark score jumped from 39% to 64%.

These industry-specific improvements are where the model might actually matter for enterprise customers. Generic benchmark improvements are cool; being markedly better at parsing medical research or legal documents is monetizable.

One standout: VendingBench, where models manage a virtual vending machine and try to make money. Opus 4.5 made $5,000. Opus 4.6 made $8,000. Gemini 3 Pro made $5,000. GPT-5.2 made $3,500. It's a weird benchmark but it measures something useful—sustained coherent decision-making over time with feedback loops.

The SaaS Apocalypse Connection

Here's where things get interesting for the business side. Anthropic recently dropped plugins for Claude to work inside Microsoft Office tools—Excel, PowerPoint, Word. Not Anthropic's own tools. Microsoft's tools.

Berman notes: "It is because these agents are doing incredible work in the tools that people use every single day, but not Anthropic's own tools. They're doing it in their competitors' tools, in Microsoft's tools."

The timing with the SaaS market collapse isn't coincidental. If AI agents can do increasingly sophisticated work directly inside the productivity tools people already use, the value proposition of specialized SaaS products gets shakier. Why pay for a separate financial analysis tool when Claude can do it in Excel? Why buy presentation software when Claude works in PowerPoint?

Microsoft presumably sees this coming, but they're in the weird position of partnering with OpenAI while watching Anthropic build into their ecosystem. The next few quarters will clarify whether this is a zero-sum displacement or if AI capabilities expand the overall market.

Pricing: Same as Before, Still Expensive

Oppus 4.6 costs the same as Opus 4.5: $5 per million input tokens for prompts under 200K tokens, $10 above that threshold. Output tokens run $25 per million (under 200K) or $37.50 (above). Prompt caching offers discounts, but this is still enterprise-tier pricing.

For context, an agent team running multiple Claude Code instances on a large codebase could easily burn through hundreds of dollars in a single session. That's fine if you're automating work that would take days of developer time, but it makes casual experimentation expensive.

Anthropic also added "adaptive thinking"—the model adjusts how much computational effort it spends based on task complexity. Plus new effort controls for fine-tuning the intelligence/speed/cost tradeoff. These are practical features for managing costs, assuming you're already committed to using the model at scale.

One note from Anthropic's documentation: "Opus 4.6 often thinks more deeply and more carefully, revisits its reasoning before settling on an answer. This produces better results on harder problems, but can add cost and latency on simpler ones."

Translation: the model sometimes overthinks easy problems, which eats tokens and slows responses. You can adjust this with effort controls, but it's a reminder that "better" doesn't always mean "better for your specific use case."

What This Actually Unlocks

The million-token context window gets attention, but the real unlock might be sustained agentic performance combined with better error correction. A model that can work independently for hours, maintain coherence across a massive codebase, catch its own mistakes, and coordinate work across parallel agents—that's qualitatively different from previous generations.

Whether it's different enough to justify the costs and complexity depends on what you're building. For developers already pushing the limits of Claude Code or similar tools, Opus 4.6 represents meaningful headroom. For everyone else, it's worth watching what gets built with it before diving in.

Berman says he's "extremely excited" to plug Opus 4.6 into Claudebot and will report back on real-world performance. That's probably the right approach—let the early adopters map the territory, then decide if you need to be there.

— Tyler Nakamura, Consumer Tech & Gadgets Correspondent