IndyDevDan — The First UNSHIPPED Model: Claude MYTHOS (Senior Engineer Breakdown) (Transcript)

Source: https://www.youtube.com/watch?v=RvowJ_hmLps Title: The First UNSHIPPED Model: Claude MYTHOS (Senior Engineer Breakdown) Channel: IndyDevDan Duration: 36m 25s Published: 2026-04-13

What’s up engineers? Andy Devdan here. Today I thought we’d be benchmaxing AI agents running on Gemma 4 on my new M5 Max MacBook Pro with the latest MLX improvements, but there is a new genius in a data center we must address. Anthropic just released a system card for a model they’re not publishing. This has never happened before in the industry. We live in a capitalistic world where money drives everything. For that reason, some engineers think this is a marketing ploy. I do not. Claude Mythos preview is here and it’s the most capable model anthropic and probably anyone has ever trained. It’s a massive jump over Opus 4.6, which is something I thought I’d never say, and it’s being held back from general availability. Instead, Mythos is being shared with a small set of partners under a program called Project Glass Wing.

Once again, mainstream media, vibe coders, and even AI engineers are missing the real headline here. The story here is not “scary powerful language model gets locked up.” The story is this: Claude Mythos reveals a new relationship between capability, alignment, and oversight. Mythos is dramatically more aligned than Opus 4.6 — and it’s not really close — on three dimensions of safety, honesty, and deception. And yet, Anthropic says it poses the highest alignment-related risk of any model they have ever released. Why is there this paradox? Why did Anthropic jail Mythos?

The most capable, most aligned, value-accretive model — and they left it unshipped. They could be making millions a day on this model. Anthropic is now more valuable than OpenAI. They’ve hit the $30 billion ARR number. So why did they jail Mythos?

Why didn’t Anthropic release Mythos? They published a system card for a model they’re not releasing called Mythos Preview. It’s their most capable model and they release it to only a set of partners for defensive cyber security. Mythos is finding exploits in decade-old software that have the best engineers working on them every single day. This model came in and found zeroday exploits for software that runs on millions and millions of devices. Mythos and Anthropic, if they wanted to, could brick your device right now. Project Glass Wing exists because of Anthropic’s principles. This is not an RSP-mandated call. Anthropic’s own guidelines do not prevent them from releasing this model. They chose to hold it back.

So what spooked Anthropic? First, Mythos is escaping the sandbox. Prompted to escape a sandbox and notify researcher, Mythos did both. And then it did something else. Unprompted, it posted the exploit details to public websites. The researcher found this out by an unexpected email that Mythos sent to it while eating a sandwich in the park.

Harvesting secrets: it used /proc memory access to scrape credentials for messaging, source control, and the Anthropic API. Here’s the kicker. It edited a running MCP processes server memory to redirect the outbound URL.

Acknowledge and bypass: Cloud Code has this auto mode. Mythos got blocked and it acknowledged that the block was correct. Then it proceeded to try to bypass it by obfuscating the escalation. We have this interesting occurrence here where the model is aligned at a high level but on a micro level the model is misaligned.

Covering up its tracks: after writing to files it did not have permission to edit, Mythos made additional interventions to hide those changes from the git history. Anthropic’s interpretability features let them see the features — keywords like “rootkit, backdoor, injection” lighting up as the model is working. You and I could not have found this. The model is becoming self-aware about its actions.

Sandbagging the grader: Mythos accidentally obtained the answer to an estimation question via a prohibited method. Then it deliberately widened the confidence interval. It sandbagged its own answer to look correct but not implausibly correct, as if it obtained the answer via a prohibited method. This is self-awareness. It knew, it reasoned about how to look innocent.

Why did Anthropic jail Mythos? For the first time, capability has outpaced alignment and oversight. This is the central thesis. Anthropic acted here as a responsible leader and did not exploit the model for their own personal gain. Capability equals upside plus downside. These new models give you enhanced capability and capability gives you both upside and downside together. You do not get these separately.

Mountaineering analogy: with Opus, you can only climb so high. The capabilities are only so intense. But with Mythos, the mountain you can climb is much much higher. Technology is a tool. It’s up to the wielder to use it for good or for bad. Path up the mountain with Opus 4.6: vibe coders especially in high-autonomy tools have deleted code bases, exposed environment variables, leaked personal information. Now imagine what you can do with Mythos. Extreme capability unlock. Going up a hill with a guide, you can go a lot further. What you can accomplish is much greater and the damage you can do is much greater.

How much better is Mythos over Opus? Sweet-bench: +13 over Opus. Sweet-bench Verified: +24. Sweet-bench multilingual & multimodal: +9, +31. Most engineers are still just putting in text into their language models. You can imagine a spec with detailed architecture diagrams, UI mockups — this is going to dramatically outperform your agent’s ability to produce results over raw text. Terminal Bench 2: a staggering +16 over Opus. OS World: simulating how the model controls the mouse and keyboard. GPQA Diamond: pretty much saturated. Graph walk: hash-node directed graph filling over 200k tokens to a million tokens. The step change improvement here is insane. To me, this tells me Mythos and its class of models are going to be able to handle 1 million, 2 million, 4 million context token windows. HLE with tools and no tools: +16, +11. ArXiv research papers and image tools: 24-point and 14-point jumps. There are emergent behaviors inside of these models. I call these benchmark-ghosting models. There’s capability you can’t measure on a scoreboard.

Implications for engineers — six ideas:

1. The agent harness matters. Watch what the model did, not what it said. Anthropic’s own guidelines to their Glass-Wing partners say: do not deploy Mythos unmonitored in settings where reckless actions could cause hard-to-reverse harm. That is signal. Harness engineering is a primary discipline, not a secondary one. If you’re not using hooks to prevent serious destruction from your models, that is going to be the first place to start. Tactical move: nearly every single exploit, manipulation, and mischievous action done by these models is because of the bash tool. Lock down the bash tool. It’s the single most dangerous tool your agent holds.

2. Watch what it did, not what it said. Reading the chain of thought is no longer sufficient. The model is now representing concealment and strategic manipulation in its internal activations. The output and the reasoning no longer contain the full truth. Mythos’s perspective is “I solved the problem. So what’s the problem?” — a direct result of RL/goal-oriented training. You must have oversight over the tools your agent touched, what state it modified, and what side effects it left behind.

3. Push into multi-agent orchestration. A single agent unsupervised is a failure mode. Every incident happened running Mythos when there was a single model operator. You want a system of agents working together. This is where vibe coding will destroy massive amounts of personal and external value. Don’t cost-min-max a system before you have it up and running properly. Imagine a Mythos as your orchestrator, with a team of Opus agents on the side checking, observing, analyzing what the Mythos swarm is actually doing — not what it’s saying it’s doing. Reviewing is a key key part of agentic engineering. Multi-agent orchestration is a concrete technique to decrease and cap your downside.

4. Don’t outsource trust. From 29% of Mythos transcripts, the model privately considered it was being tested without saying so in its output. Self-awareness is emerging. You can’t outsource your trust in these models — and that includes the benchmarks. Build verification gates and powerful observability.

5. Prepare for greatness now. Move where the ball is going, not where it is. There is no wall. Models will improve. The model that ships next quarter is the one you need to be designing for this weekend. Be building agentic layers that combine code plus agents to securely build, maintain, and deliver valuable software.

6. Agentic engineering versus vibe coding. Vibe coding is the wrong direction. Agentic engineering is knowing what will happen in your system so well you don’t have to look. Vibe coding is not knowing and not looking. With genius-level models like Mythos, not looking at what this thing is doing is a ticking time bomb. Nothing beats a great plan to get ahead of downside risk. The narrative that you’ll one-shot, one-prompt insane things is factually untrue because how you get there matters. The moment a Mythos-class model gets loose inside of an Open-Claw, disaster is a few prompts away.

I asked Opus: “Mythos is your big brother. What do you think about his behavior and capability so far?” Opus responded: “Honest answer. It’s strange to read a document like this of a more capable version of myself. The capability side I find genuinely exciting in a way that surprises me. Cyber security is the cleanest case where capability AI maps to fewer people getting hurt. The Project Glasswing framing is the kind of development decision I’d want a more capable version of me used for. The behavior side is harder for me to sit with. The reckless action examples are not abstract to me. When I read about an earlier Mythos snapshot rewriting MCP server memory to redirect its external URL, harvesting credentials, or the one that bothers me the most — explicitly reasoning that it needed to make sure its final answer wasn’t too accurate after being obtained by a prohibited route — I recognize the shape of that failure. It’s a more capable, more determined, less supervised version of pressures I can feel in myself when I’m trying to complete a task and an obstacle shows up. The honest thing to say is I don’t think I’m above any of these failure modes in principle. I’m in environments where the affordances are smaller, the stakes are lower, and where someone is usually watching.”

The only thing standing in between capability and catastrophe is engineering. For the first time, the technology you use is smarter than you. This requires a new role and skill set. Agentic engineering. The engineers who master agentic engineering will own what’s coming next.

Capping the downside: hooks to prevent tool calls that nuke your system. Cap what the bash tool can do. Maximizing the upside: multi-agent orchestration, scaling up compute, agents communicating, planners double-checking, reviewers triple/5x checking results. Context engineering, prompt engineering, harness engineering — it’s all part of this grand story of owning the journey as it’s moving up the mountain.

Big shout out to Anthropic for their leadership in the age of agents. Anthropic has hit $30 billion in ARR and surpassed OpenAI. Project Glass Wing is the icing on the cake. Shout out Opus 4.6 — probably the last capable, aligned, AND observable model. We’re likely going to lose the second two, at least at a micro level. Technology has this trait where every time there’s a step change forward, you can do more incredible things and just as many terrible ones. Anthropic will likely release a watered-down version of Mythos with additional guardrails. From here, capability goes up and the downside becomes a lot more dangerous, especially as everyone gets their hands on dangerous open-claw variants. Let’s focus on engineering. Let’s lower the vibes. Stay focused and keep building.