AI Control

After years of development, Anthropode released Cladius 6-7. The code-writing perfomance is substantially better than existing models, across the board. Early adopters are able to write code faster than the best engineers, spot bugs before they ship, and automate the tedious parts of software development. Thanks to this, they can relax in their homes, and get to finally write code they actually like!

Cladius’ fanbase grows at an astonishing pace, and leaderboard after leaderboard becomes obsolete. The few living Turing-test skeptics find more absurd arguments to hold their position. The team at the AI-enabled software development IDE, Curser, soon makes it available on their platform. Many have it as their default.

me: can you center this div
the unreleased Chinese base model I accidentally picked on openrouter: pic.twitter.com/n77tXfs16X
— Hero Thousandfaces (@1thousandfaces_) October 15, 2025

But some people can’t sleep at night. Surgeries fail, planes fall, innocent people go to jail. Imperfect safety has rarely stopped us. But AI is different, because it can suddenly stop being safe. How do you guarantee something will continue being as safe as you expected it to be, even during deployment? More importantly:

How do you use an AI system you can’t fully trust?

This is the core problem of AI Control. And surprisingly, we can reason about it with mathematical precision. What follows is a journey through the key ideas: what “safety” even means, the protocols that trade off safety against usefulness, and why the quality of your monitoring system determines almost everything.

What Does Safety Even Mean?

Even before deployment, you need to settle for some measure of safety. If an AI tries to attack you, how do you quantify “safe enough”?

Here’s the setup: Your AI system runs for many rounds, thousands or millions of lines of code per day. A trusted monitor watches each output and flags anything suspicious. If it catches an attack, the system shuts down. But some attacks will slip through undetected.

What’s the probability that nothing slips through?

We need two numbers to reason about this:

c ∈ [0,1] — the probability your monitor catches an attack
p ∈ [0,1] — the probability the AI attacks on any given round

In a single round, there are three basic outcomes:

Outcome	Probability	Result
No attack	1 − p	System continues
Attack caught	p · c	System shuts down
Attack succeeds	p · (1−c)	Undetected — continues

You might wonder what happened with the fourth outcome: no attack. However, rounds without attacks are safe by definition. We only care about what happens when the AI attacks. And each attack is essentially an independent coin flip:

Attack 1: caught with prob c, slips through with prob (1−c)
Attack 2: caught with prob c, slips through with prob (1−c)
… and so on, until some attack gets caught (shutdown)

We are assuming we shutdown the AI as soon as we catch it red handed. So let’s count: how many attacks slip through before we catch one?

Call this K, the number of successful (undetected) attacks before shutdown.

K = 0 means the first attack is caught. P(K = 0) = c
K = 1 means one slips through, then we catch the second. P(K = 1) = (1−c) · c
K = 2 means two slip through, third caught. P(K = 2) = (1−c)² · c

The pattern: P(K = k) = (1−c)^k · c people into probability will recognize this is the Geometric(c) distribution.

Safety = probability of zero undetected attacks = P(K = 0) = c

In our 2-parameter statistical model, we can model safety as E[K] = (1−c) / c.

An interesting result of our mathematical simplification is the lack of p to calculate safety. This is because p only affects how many rounds pass between attacks, not what happens when attacks occur. In most cases, Safety depends entirely on your monitor quality.

This tells us what to optimize: not just “make the AI less likely to attack” but “make damn sure you catch attacks when they happen.”

Geometric Distribution of Undetected Attacks

Catch Rate c: 0.80

Safety = P(K=0)

80%

E[K] = expected undetected

0.25

The cost of safety

But if it all came down to catching every attack, why not use your trusted model all the time? Another key assumption of control is that the trusted model is less capable than the untrusted one. We want to use Cladius 6-7 because it’s “better” but “unsafer” than its predecesor. Whenever the trusted monitor flags something, it needs to be handled, and even if you don’t throw it out entirely, you will produce “worse” outputs.

So we can’t have it all. We face a fundamental tradeoff between safety and usefulness. The interesting question becomes: which protocol gives us the best rate?

Different strategies handle suspicious outputs differently. There are many specialized ones in the literature, depending on the specific task at hand (not everything is code generation), but a few recurring ones are variations on:

Trusted Editing: A trusted (but less capable) model reviews and edits flagged outputs
Defer to Trusted: Route flagged tasks entirely to the trusted model
Resample: Just ask the untrusted model to try again

Audit Budget: 10%

Defer Threshold: 0.30

Suspicion Distributions

Honest

Attack

Pareto Frontiers: Safety vs Usefulness

Trusted Editing

Defer to Trusted

Resample

Enter the Red Team

Here’s the uncomfortable part: everything above assumed a naive attacker.

We conditioned our model on a fixed attack distribution that we can post-hoc adapt to. Reality is sadly a bit more depressing, and we won’t know what an attacker might get up to. A sophisticated attacker will try and produce outputs that don’t look suspicious and intentionally blend it in with honest work.

In our models, this looks as if the attack distribution shifted left toward the honest one. Now our carefully tuned thresholds are curring through usefulness!

This is why AI control is fundamentally a game, not a one-time engineering problem. The red team (simulating a misaligned AI) tries to find attack strategies that slip past blue team defenses. The blue team hardens their protocols. Each team maximizes competing objectives.

Attack Evasiveness: 0%

Attack Consistency: 0%

Fixed: Audit Budget = 2%, Defer Threshold = 0.30

Suspicion Distributions

Honest

Attack

Baseline Attack

Pareto Frontiers: Safety vs Usefulness

Trusted Editing

Defer to Trusted

Resample

Safety degradation: 0%

The Monitor’s Dial

So everything hinges on the monitor. How do we make it better? And once we have one, how do we tune it?

We can treat the monitor as a basic binary classifier for analysis purposes, and look at the ROC curve. This shows us the tradeoff between catchign real attacks and falsely flagging honest work.

Weak Strong

AUC: 0.85

ROC Curve (drag the point)

↔ Drag point to adjust threshold

Resulting Safety-Usefulness Tradeoff

True Positive

75%

False Positive

10%

Safety

82%

Usefulness

91%

What Does Safety Even Mean?

Geometric Distribution of Undetected Attacks

The cost of safety

Suspicion Distributions

Pareto Frontiers: Safety vs Usefulness

Enter the Red Team

Suspicion Distributions

Pareto Frontiers: Safety vs Usefulness

The Monitor’s Dial

ROC Curve (drag the point)

Resulting Safety-Usefulness Tradeoff

Read more!