Claude Mythos Found Zero-Days in Minutes. Your

Anthropic's Claude Mythos leaked last week, and security researchers at a San Francisco conference immediately started calling it "terrifyingly good" at finding vulnerabilities. One of the most experienced security researchers in the room reported that Mythos found zero-day vulnerabilities in Ghost—a 50,000-star GitHub repo with no major known issues—within minutes of being pointed at it.

The model hasn't officially launched yet. Anthropic has confirmed its existence and given it a new lineage name (Capybara, apparently—we've moved from precious stones to furry animals). It's the first model trained on Nvidia's new GB chips, and by most measures, it represents a genuine step change in capability, not just another incremental benchmark improvement.

The immediate security implication is obvious: as soon as Mythos becomes available, anyone in IT or security needs to battle-test it against their own systems. Day zero, job number one. Because if Mythos can find issues in a well-maintained 50,000-star repo that the security community has scrutinized for years, it can probably find issues in yours.

But the security angle is just the most dramatic manifestation of something larger. What Mythos represents—and what similar models from Google and OpenAI will represent when they arrive in the next few months—is a forcing function for simplification. When models get meaningfully smarter, they expose how much of our current infrastructure exists to compensate for their previous limitations.

The Bitter Lesson About Building With LLMs

Nate B. Jones, an AI strategy consultant covering the leak, frames this as "the bitter lesson of building with LLMs." Humans like to think we add value through complexity—elaborate prompt scaffolding, intricate retrieval architectures, hard-coded domain knowledge, multi-stage verification pipelines. "We as humans think we have a lot of value to add to these models," Jones says in his analysis. "We can add our judgment, we can complexify, we can add a lot of scaffolding and systems around these models and it will make them better. And as they get more powerful, the bitter lesson is that simpler works best."

The term "bitter lesson" comes from AI researcher Richard Sutton's observation that general methods that leverage computation ultimately outperform approaches that rely on human knowledge. In the context of LLM tooling, it means that your 3,000-token system prompt full of procedural instructions—first classify the intent, then check for hallucinated URLs, then do X, then Y—was written because the model couldn't reliably skip those steps. When a model gets two or three times smarter, you might be able to delete 30-50% of that prompt.

This applies across the stack. Retrieval architecture that currently requires careful human-designed logic about what documents to surface and when? With larger context windows and smarter models, you increasingly just point the model at a well-organized repository and let it decide what it needs. Hard-coded business rules and domain knowledge? The model can infer most of that from context examples. Multi-stage verification gates in your software pipeline? Consolidate to a single comprehensive eval at the end.

Jones describes his own experience: "I had a prompt that I was using around how I do research. And I'd been using it for a couple of model generations. And one day I forgot to put the full 10-line prompt in and I just put a one-liner and said 'go research' and I got a better result back because the 10-line prompt was more detailed about methodology than it needed to be and was over-constraining my model."

The pattern holds across technical and non-technical use cases. If you're building agents, ask yourself whether each instruction exists because the model needs it or because you needed the model to need it. If you're just using these tools for work, stop explaining your role and context in such detail—the model will infer it reliably enough from what you're asking it to do.

What Actually Needs To Change

Four categories of infrastructure are likely to break or become inefficient when Mythos-class models arrive:

Prompt scaffolding. The recommendation from both Anthropic and OpenAI is converging: tell the model what you need and why, not how to get there. Procedural instructions made sense when models needed hand-holding. They become friction when models can navigate the task space themselves.

Retrieval architecture. This doesn't mean RAG is dead—it means the division of labor shifts. You still decide what resources the model can access initially, but less about how it navigates those resources once it's looking. "Our goal increasingly is to say here is the goal, go get it done and then to measure success," Jones argues. "And that's it."

Hard-coded domain knowledge. If your prompts include extensive rules about house style, business processes, or technical standards, ask whether the model could infer those from a few examples instead. As Jones puts it: "The art of prompting for the first couple years of LLM was about what you put in, increasingly the art of prompting is about what you leave out."

Verification gates. For software development, the trend is toward fewer intermediate checks and one comprehensive eval at the end. Human review is already becoming a bottleneck—there are conversations happening in San Francisco about how humans can't review all the code these systems produce. Mythos will make that worse if you're depending on human handoffs.

The economic logic reinforces this. Mythos-class models are expensive to run—Anthropic has essentially confirmed they'll initially only be available on premium plans. You want to use tokens efficiently, which means not cluttering context windows with instructions the model doesn't need.

The Uncomfortable Part

The bitter lesson is called bitter for a reason. Much of what felt like skilled AI engineering—the careful prompt design, the elegant retrieval logic, the thoughtful scaffolding—turns out to be temporary compensation for model limitations. As those limitations fall away, the value of that work decreases.

This doesn't mean prompting skill becomes worthless. It means the skill evolves: from elaborate instruction-writing toward outcome specification and knowing what to leave out. From architectural complexity toward simplicity that trusts the model more.

Jones frames this as a career question: "If you're thinking about your future, about your career, you need to ask yourself basically, am I in a position where I can..." The video cuts off there, but the implication is clear. The ability to see a new model coming and preemptively simplify your systems—to model in advance what new intelligence enables—becomes a valuable and hard-to-measure skill.

Anthropic is taking the unusual step of letting security researchers test Mythos early, specifically so critical infrastructure can harden defenses before the model is widely available. That's responsible, and it suggests they understand what they've built. But it also means the window for preparation is narrow. Mythos could launch within a month or two. Google and OpenAI will have similar models shortly after.

The question isn't whether these models arrive. It's whether your stack is ready to get simpler.

—Dev Kapoor