system: OPERATIONAL
← back to all hacks
DEFENSE LOW NEW

Stop scoring jailbreak defenses on attack success rate alone

A May 2026 IEEE S&P paper argues that attack success rate — the field's default metric — hides how jailbreak defenses actually behave. Its Security Cube evaluates them across several axes at once.

2026-06-02 // 6 min affects: jailbreak-defenses, safety-guardrails, safety-classifiers, aligned-llms

What is this?

A jailbreak defense that “blocks 95% of attacks” tells you almost nothing useful on its own. That single number — attack success rate (ASR) — is the metric most jailbreak papers report, and a new systematization argues it is the wrong lens for deciding which defense to deploy.

On May 6, 2026, Feiyue Xu, Hongsheng Hu, Chaoxiang He, Bin Benjamin Zhu, Dawu Gu, Shuo Wang and co-authors posted SoK: Robustness in Large Language Models against Jailbreak Attacks (arXiv:2605.05058), to appear at the 47th IEEE Symposium on Security and Privacy (May 18–20, 2026). The paper builds a taxonomy of jailbreak attacks and defenses and introduces Security Cube, a multi-dimensional evaluation framework. Its core claim for practitioners: evaluation practice in this field is inadequate because it leans on narrow metrics that “fail to capture the multidimensional nature of LLM security.”

How does the Security Cube work?

The Security Cube reframes a jailbreak defense as something you measure on several axes simultaneously, not a single pass/fail score. The paper pairs the taxonomy with benchmark studies on 13 representative attacks and 5 defenses, plus automated judges, to map the current landscape.

The shift matters because ASR collapses several independent questions into one figure:

  • Robustness across attack families. A defense tuned against one attack type (say, role-play prompts) can score well on average while staying wide open to another (encoding tricks, multi-turn escalation). One blended ASR hides that gap.
  • Utility cost. A guardrail that refuses borderline-but-benign requests can post a great ASR while quietly degrading the product. Safety and helpfulness trade against each other, and a single number shows only one side.
  • Judge dependence. “Success” is decided by an automated judge, and judges disagree. An ASR is only as trustworthy as the judge that produced it — a point the paper studies directly.
  • Cost and latency. Two defenses with identical ASR can differ by orders of magnitude in compute. The expensive one may be unshippable.

This echoes earlier systematic work. The PandaGuard framework (Shen et al., arXiv:2505.13862) modeled jailbreak safety as attacker–defender–judge interactions across 49 models and reported the same lesson from the benchmark side: defense cost–performance trade-offs and judge consistency only appear when you stop reporting one aggregate number.

Why it matters

For anyone choosing a jailbreak defense, a vendor’s headline ASR is not a purchasing decision. Two products advertising “99% block rate” can behave completely differently once you ask which attacks, at what utility cost, judged how, and at what latency. The Security Cube is essentially a checklist for asking those questions in a structured way.

It also explains a recurring disappointment: a defense that benchmarks brilliantly often underperforms in production. Frequently the published ASR was measured against a fixed attack set, while real adversaries adapt — a dynamic our coverage of multi-turn jailbreak benchmarks and the output-filtering study has repeatedly shown.

Defenses

You can apply the paper’s framing without reading all of it. When evaluating a jailbreak defense:

  1. Demand per-attack-family results, not an average. Ask for breakdowns across distinct attack classes — encoding, persuasion, many-shot, mathematical encoding, multi-turn. An average ASR can mask a total failure on one family.
  2. Measure the utility tax. Run the defense against a benign workload and track over-refusals and quality loss. A defense you have to disable in production protects nothing.
  3. Pin down the judge. Know what decided “success.” Re-score a sample with a second judge; if the verdict moves, the headline number is soft.
  4. Test against an adaptive attacker. Static-set ASR is a floor, not a guarantee. Defenses that hold under fixed prompts often fall to an attacker who optimizes against them.
  5. Budget cost and latency as first-class axes. Treat compute and added latency as part of the evaluation, not an afterthought.
  6. Layer defenses; don’t bet on one number. Combine input checks, representation-based detection and independent output filtering so no single metric is your whole safety story.

Status

ItemReferenceDateNotes
SoK: Robustness against Jailbreak Attacks (Security Cube)arXiv:2605.050582026-05-06To appear IEEE S&P 2026; 13 attacks / 5 defenses benchmarked; multi-axis evaluation
PandaGuard / PandaBencharXiv:2505.138622025-05Attacker–defender–judge framework; 49 models; cost/judge trade-offs

The headline is not a new attack or a new defense. It is a measurement correction: a jailbreak defense is a multi-dimensional object, and any evaluation — or sales pitch — that reduces it to one percentage is hiding most of what you need to know.

Sources