After years of development, Anthropode released Cladius 6-7. The code-writing perfomance is substantially better than existing models, across the board. Early adopters are able to write code faster than the best engineers, spot bugs before they ship, and automate the tedious parts of software development. Thanks to this, they can relax in their homes, and get to finally write code they actually like!
Cladius’ fanbase grows at an astonishing pace, and leaderboard after leaderboard becomes obsolete. The few living Turing-test skeptics find more absurd arguments to hold their position. The team at the AI-enabled software development IDE, Curser, soon makes it available on their platform. Many have it as their default.
me: can you center this div
— Hero Thousandfaces (@1thousandfaces_) October 15, 2025
the unreleased Chinese base model I accidentally picked on openrouter: pic.twitter.com/n77tXfs16X
But some people can’t sleep at night. Surgeries fail, planes fall, innocent people go to jail. Imperfect safety has rarely stopped us. But AI is different, because it can suddenly stop being safe. How do you guarantee something will continue being as safe as you expected it to be, even during deployment? More importantly:
How do you use an AI system you can’t fully trust?
This is the core problem of AI Control. And surprisingly, we can reason about it with mathematical precision. What follows is a journey through the key ideas: what “safety” even means, the protocols that trade off safety against usefulness, and why the quality of your monitoring system determines almost everything.
Even before deployment, you need to settle for some measure of safety. If an AI tries to attack you, how do you quantify “safe enough”?
Here’s the setup: Your AI system runs for many rounds, thousands or millions of lines of code per day. A trusted monitor watches each output and flags anything suspicious. If it catches an attack, the system shuts down. But some attacks will slip through undetected.
What’s the probability that nothing slips through?
We need two numbers to reason about this:
In a single round, there are three basic outcomes:
| Outcome | Probability | Result |
|---|---|---|
| No attack | 1 − p | System continues |
| Attack caught | p · c | System shuts down |
| Attack succeeds | p · (1−c) | Undetected — continues |
You might wonder what happened with the fourth outcome: no attack. However, rounds without attacks are safe by definition. We only care about what happens when the AI attacks. And each attack is essentially an independent coin flip:
We are assuming we shutdown the AI as soon as we catch it red handed. So let’s count: how many attacks slip through before we catch one?
Call this K, the number of successful (undetected) attacks before shutdown.
The pattern: P(K = k) = (1−c)k · c people into probability will recognize this is the Geometric(c) distribution.
Safety = probability of zero undetected attacks = P(K = 0) = c
In our 2-parameter statistical model, we can model safety as E[K] = (1−c) / c.
An interesting result of our mathematical simplification is the lack of p to calculate safety. This is because p only affects how many rounds pass between attacks, not what happens when attacks occur. In most cases, Safety depends entirely on your monitor quality.
This tells us what to optimize: not just “make the AI less likely to attack” but “make damn sure you catch attacks when they happen.”
But if it all came down to catching every attack, why not use your trusted model all the time? Another key assumption of control is that the trusted model is less capable than the untrusted one. We want to use Cladius 6-7 because it’s “better” but “unsafer” than its predecesor. Whenever the trusted monitor flags something, it needs to be handled, and even if you don’t throw it out entirely, you will produce “worse” outputs.
So we can’t have it all. We face a fundamental tradeoff between safety and usefulness. The interesting question becomes: which protocol gives us the best rate?
Different strategies handle suspicious outputs differently. There are many specialized ones in the literature, depending on the specific task at hand (not everything is code generation), but a few recurring ones are variations on:
Here’s the uncomfortable part: everything above assumed a naive attacker.
We conditioned our model on a fixed attack distribution that we can post-hoc adapt to. Reality is sadly a bit more depressing, and we won’t know what an attacker might get up to. A sophisticated attacker will try and produce outputs that don’t look suspicious and intentionally blend it in with honest work.
In our models, this looks as if the attack distribution shifted left toward the honest one. Now our carefully tuned thresholds are curring through usefulness!
This is why AI control is fundamentally a game, not a one-time engineering problem. The red team (simulating a misaligned AI) tries to find attack strategies that slip past blue team defenses. The blue team hardens their protocols. Each team maximizes competing objectives.
Fixed: Audit Budget = 2%, Defer Threshold = 0.30
So everything hinges on the monitor. How do we make it better? And once we have one, how do we tune it?
We can treat the monitor as a basic binary classifier for analysis purposes, and look at the ROC curve. This shows us the tradeoff between catchign real attacks and falsely flagging honest work.
AI Control: Improving Safety Despite Intentional Subversion https://arxiv.org/abs/2312.06942 (seminal AI Control paper)
Ctrl-Z: Controlling AI Agents via Resampling https://arxiv.org/pdf/2504.10374 (introduced resampling)
Optimizing AI Agent Attacks With Synthetic Data https://arxiv.org/pdf/2511.02823 (statistical view of control evaluations)
Games for AI Control: Models of Safety Evaluations of AI Deployment Protocols (game-theoretic definition of control evaluations)
Based on AI Control: Improving Safety Despite Intentional Subversion (Greenblatt et al., 2023)