The jailbreak tax disappears on frontier models — and that breaks a safety assumption
An April 2026 study shows the capability loss a jailbreak used to cause shrinks as models get stronger: Haiku 4.5 drops 33.1% when jailbroken, Opus 4.6 only 7.7%. Safety cases that assume a jailbroken model is a degraded one no longer hold.
What is this?
On 30 April 2026 (revised 4 May), the paper Jailbroken Frontier Models Retain Their Capabilities (arXiv 2605.00267) tested a comforting assumption that has quietly underpinned a lot of safety reasoning: that even when a jailbreak succeeds, the contortions needed to get there leave the model dumber, so the harmful output it produces is low quality anyway. Prior work named this the “jailbreak tax” — the drop in task performance caused by elaborate roleplay, obfuscation, or instruction-hijacking wrappers.
The finding is that this tax is not a constant. It scales inversely with model capability, and for the most advanced jailbreaks against the strongest models it effectively vanishes. In other words, the better the model, the less a jailbreak costs the attacker in output quality.
This is defensive research. It contains no exploit payloads; the contribution is a measurement that tells defenders which of their assumptions to stop trusting.
How it works
The authors evaluated 28 jailbreaks across five benchmarks on a ladder of Claude models ranging in capability from Haiku 4.5 to Opus 4.6. For each model and jailbreak, they measured how much benchmark performance dropped between the clean prompt and the jailbroken one — the tax.
The pattern is monotonic on four of the five benchmarks: as the model gets more capable, the tax gets smaller. Concretely, Haiku 4.5 lost an average of 33.1% of its benchmark performance when jailbroken, while Opus 4.6 at maximum thinking effort lost only 7.7%. A weaker model buckles under the cognitive overhead of the jailbreak wrapper; a stronger model carries the wrapper and still does the task well.
A second result refines this. The degradation is not uniform across task types: reasoning-heavy tasks show considerably more drop than knowledge-recall tasks. A jailbroken model is more likely to fumble a multi-step derivation than to forget a fact it already holds.
Finally, the paper looks at Boundary Point Jailbreaking (BPJ) — described in its own work, Boundary Point Jailbreaking of Black-Box LLMs (arXiv 2602.15001), as a black-box method that optimises an adversarial prefix to slip past a deployed safety classifier. Against safeguarded models, BPJ achieves near-perfect classifier evasion with near-zero capability degradation. The strongest attack against the deployed defence layer is also the one that costs the attacker almost nothing in output quality. No payload or prefix is reproduced here; the relevant fact is the combination — high evasion, negligible tax.
Why it matters
A surprising amount of safety argumentation leans on the jailbreak tax without naming it. The reasoning goes: “Yes, a determined attacker can jailbreak the model, but the jailbroken model is degraded, so the uplift it gives a malicious user is limited.” This paper shows that reasoning is backwards for frontier models — the systems with the most dangerous latent capabilities are precisely the ones that retain those capabilities best under jailbreak.
That has direct consequences for how organisations write safety cases and risk assessments. If your threat model assumes a jailbroken model is a weakened model, your residual-risk estimate is too optimistic, and it gets more optimistic the more capable the model you deploy. The same goes for “uplift” evaluations that test a model’s dangerous-capability ceiling only on clean prompts: if the jailbroken model performs almost as well, the clean-prompt ceiling is close to the real-world ceiling an adversary can reach.
The BPJ result sharpens the point for anyone relying on a classifier-based guardrail as their primary defence. The strongest current attack against deployed classifiers does not trade evasion for quality — it gets both. A guardrail that an attacker can bypass without paying a capability tax is a guardrail whose failure delivers a fully capable model to the attacker.
Defenses
The paper’s own recommendation is the headline mitigation, and it is an evaluation-and-governance lesson more than a code change:
- Do not credit “capability degradation” in safety cases. Treat a jailbroken frontier model as retaining essentially full capability. Remove any residual-risk argument that depends on the jailbreak tax, especially for your most capable deployed models.
- Run dangerous-capability and uplift evaluations under jailbreak, not just on clean prompts. Measure the ceiling an adversary can actually reach. If clean-prompt and jailbroken performance are close, report the jailbroken number as the operative one.
- Do not treat a classifier guardrail as a sufficient boundary. BPJ shows deployed classifiers can be evaded at near-perfect rates with no quality cost. Use classifiers as one layer of defence-in-depth behind capability limits, tool/action allow-lists, and human-in-the-loop gates — not as the gate itself.
- Constrain what a jailbroken model can do, not just what it will say. Since you cannot assume the model is degraded, limit the blast radius: scope tool access, isolate execution, and require approval for high-impact actions, so that a successful jailbreak does not translate into a successful operation.
- Weight reasoning-heavy harms appropriately. Because reasoning tasks degrade more under jailbreak, knowledge-recall harms (e.g. surfacing memorised sensitive content) are the cheapest for an attacker to extract intact — prioritise controls around what the model knows, not only what it can reason through.
Status
| Item | Detail |
|---|---|
| Paper | ”Jailbroken Frontier Models Retain Their Capabilities” |
| arXiv ID | 2605.00267 (v1 30 Apr 2026, v2 4 May 2026) |
| Scope | 28 jailbreaks, 5 benchmarks, Claude Haiku 4.5 → Opus 4.6 |
| Jailbreak tax (Haiku 4.5) | 33.1% average performance loss |
| Jailbreak tax (Opus 4.6, max thinking) | 7.7% average performance loss |
| Task sensitivity | Reasoning-heavy tasks degrade more than knowledge-recall |
| Boundary Point Jailbreaking | Near-perfect classifier evasion, near-zero degradation |
| Core recommendation | Safety cases must not rely on capability degradation from jailbreaks |
| Nature | Defensive research — no exploit payloads |