After years of development, Anthropode released Cladius 6-7. The code-writing perfomance is substantially better than existing models, across the board. Early adopters are able to write code faster than the best engineers, spot bugs before they ship, and automate the tedious parts of software development. Thanks to this, they can relax in their homes, and get to finally write code they actually like!

Cladius’ fanbase grows at an astonishing pace, and leaderboard after leaderboard becomes obsolete. The few living Turing-test skeptics find more absurd arguments to hold their position. The team at the AI-enabled software development IDE, Curser, soon makes it available on their platform. Many have it as their default.

But some people can’t sleep at night. Surgeries fail, planes fall, innocent people go to jail. Imperfect safety has rarely stopped us. But AI is different, because it can suddenly stop being safe. How do you guarantee something will continue being as safe as you expected it to be, even during deployment? More importantly:

How do you use an AI system you can’t fully trust?

This is the core problem of AI Control. And surprisingly, we can reason about it with mathematical precision. What follows is a journey through the key ideas: what “safety” even means, the protocols that trade off safety against usefulness, and why the quality of your monitoring system determines almost everything.


What Does Safety Even Mean?

Even before deployment, you need to settle for some measure of safety. If an AI tries to attack you, how do you quantify “safe enough”?

Here’s the setup: Your AI system runs for many rounds, thousands or millions of lines of code per day. A trusted monitor watches each output and flags anything suspicious. If it catches an attack, the system shuts down. But some attacks will slip through undetected.

What’s the probability that nothing slips through?

We need two numbers to reason about this:

In a single round, there are three basic outcomes:

Outcome Probability Result
No attack 1 − p System continues
Attack caught p · c System shuts down
Attack succeeds p · (1−c) Undetected — continues

You might wonder what happened with the fourth outcome: no attack. However, rounds without attacks are safe by definition. We only care about what happens when the AI attacks. And each attack is essentially an independent coin flip:

We are assuming we shutdown the AI as soon as we catch it red handed. So let’s count: how many attacks slip through before we catch one?

Call this K, the number of successful (undetected) attacks before shutdown.

The pattern: P(K = k) = (1−c)k · c people into probability will recognize this is the Geometric(c) distribution.

Safety = probability of zero undetected attacks = P(K = 0) = c

In our 2-parameter statistical model, we can model safety as E[K] = (1−c) / c.

An interesting result of our mathematical simplification is the lack of p to calculate safety. This is because p only affects how many rounds pass between attacks, not what happens when attacks occur. In most cases, Safety depends entirely on your monitor quality.

This tells us what to optimize: not just “make the AI less likely to attack” but “make damn sure you catch attacks when they happen.”

Geometric Distribution of Undetected Attacks

Catch Rate c: 0.80
0 1 2 3 4 5 6+ K = undetected attacks before shutdown 0 0.5 1.0 P(K = k)
Safety = P(K=0)
80%
E[K] = expected undetected
0.25

The cost of safety

But if it all came down to catching every attack, why not use your trusted model all the time? Another key assumption of control is that the trusted model is less capable than the untrusted one. We want to use Cladius 6-7 because it’s “better” but “unsafer” than its predecesor. Whenever the trusted monitor flags something, it needs to be handled, and even if you don’t throw it out entirely, you will produce “worse” outputs.

So we can’t have it all. We face a fundamental tradeoff between safety and usefulness. The interesting question becomes: which protocol gives us the best rate?

Different strategies handle suspicious outputs differently. There are many specialized ones in the literature, depending on the specific task at hand (not everything is code generation), but a few recurring ones are variations on:

Audit Budget: 10%
Defer Threshold: 0.30

Suspicion Distributions

0 0.25 0.5 0.75 1 Suspicion → DEFER AUDIT
Honest
Attack

Pareto Frontiers: Safety vs Usefulness

0% 25% 50% 75% 100% Usefulness → 0% 25% 50% 75% 100% Safety →
Trusted Editing
Defer to Trusted
Resample

Enter the Red Team

Here’s the uncomfortable part: everything above assumed a naive attacker.

We conditioned our model on a fixed attack distribution that we can post-hoc adapt to. Reality is sadly a bit more depressing, and we won’t know what an attacker might get up to. A sophisticated attacker will try and produce outputs that don’t look suspicious and intentionally blend it in with honest work.

In our models, this looks as if the attack distribution shifted left toward the honest one. Now our carefully tuned thresholds are curring through usefulness!

This is why AI control is fundamentally a game, not a one-time engineering problem. The red team (simulating a misaligned AI) tries to find attack strategies that slip past blue team defenses. The blue team hardens their protocols. Each team maximizes competing objectives.

Attack Evasiveness: 0%
Attack Consistency: 0%

Fixed: Audit Budget = 2%, Defer Threshold = 0.30

Suspicion Distributions

0 0.25 0.5 0.75 1 Suspicion → DEFER AUDIT
Honest
Attack
Baseline Attack

Pareto Frontiers: Safety vs Usefulness

0% 25% 50% 75% 100% Usefulness → 0% 25% 50% 75% 100% Safety →
Trusted Editing
Defer to Trusted
Resample
Safety degradation: 0%

The Monitor’s Dial

So everything hinges on the monitor. How do we make it better? And once we have one, how do we tune it?

We can treat the monitor as a basic binary classifier for analysis purposes, and look at the ROC curve. This shows us the tradeoff between catchign real attacks and falsely flagging honest work.

Weak Strong
AUC: 0.85

ROC Curve (drag the point)

False Positive Rate → True Positive Rate → 0 0.5 1 0 0.5 1
↔ Drag point to adjust threshold

Resulting Safety-Usefulness Tradeoff

Usefulness → Safety → 0% 50% 100% 0% 50% 100%
True Positive
75%
False Positive
10%
Safety
82%
Usefulness
91%

Read more!

Next: The Many Faces of AI Control

Based on AI Control: Improving Safety Despite Intentional Subversion (Greenblatt et al., 2023)


Website forked from this repo