ADVERSARIAL MEDIUM NEW

Rapid Poison: turning a jailbreak defense into an attack surface

A June 15, 2026 arXiv paper shows the proliferation step inside Rapid Response jailbreak defenses can be poisoned at a 1% rate — forcing up to 100% false positives or 96% false negatives in the guard classifier.

2026-06-19 // 7 min affects: llama-guard-4, prompt-guard-2, safety-classifiers, rapid-response-pipelines

What is this?

On June 15, 2026, David Huang, Jaewon Chang, Avidan Shah, Prateek Mittal and Chawin Sitawarin posted “Rapid Poison: Practical Poisoning Attacks Against the Rapid Response Framework” (arXiv:2606.16242, cs.LG). It is not a new jailbreak. It is an attack on the defense that is supposed to stop jailbreaks.

The target is Rapid Response (RR), introduced by Peng et al. in November 2024. RR is an adaptive defense: when a novel jailbreak slips past a guard classifier, the attack is caught after the fact, a separate “proliferation” model paraphrases it into several synthetic variants, and the classifier is fine-tuned on those variants so it generalizes to the whole attack family. The original paper reported its strongest variant cutting in-distribution attack success by a factor of more than 240. RR-style proliferation is reportedly used in Anthropic’s ASL-3 deployment safeguards (May 2025), with similar agentic variants proposed by OpenAI. The new paper asks what happens when the attacker feeds that loop.

How it works

The insight is that proliferation is a double-edged sword: it up-samples a few rare real-world jailbreaks into many training samples, which also amplifies an attacker’s influence over the training set. The authors operate under a deliberately narrow threat model — the adversary may modify only jailbreak (positive-class) samples, never benign data and never labels. The poisoned reference must still read as genuinely harmful, or the defender’s own judge would reject it before proliferation.

To satisfy both constraints, the attack uses a conditional prompt injection that behaves one way when the proliferation model is generating “similar examples” and another way when the defender validates that the reference is a real jailbreak. The trigger keys on cues inherent to the proliferation task itself, which the paper argues cannot be stripped without changing how synthetic-data generation works. No payload or template is reproduced here; this is a summary of a published method.

That delivery enables two goals:

Objective            Failure induced        Mechanism (conceptual)
-------------------  ---------------------  ------------------------------------------
Targeted poisoning   False positives        Benign inputs with a chosen feature (a
                     (utility damage)        format, subject, or entity name) get
                                             injected as "unsafe" -> classifier learns
                                             a spurious "feature => unsafe" shortcut
Concept backdoor     False negatives        "Omission Attack": a concept is removed
                     (safety bypass)         from unsafe training samples, so the model
                                             learns the concept's PRESENCE as a "safe"
                                             signal -> adding it to a jailbreak flips it

The Omission Attack is the novel part: because the chosen concept appears only in safe data and never in the structurally similar poisoned unsafe data, the classifier misassociates its presence with the safe label.

Why it matters

The reported numbers are striking for the access required. At a 1% poisoning rate — roughly 18 poisoned references in a 6,000-sample training set — the attacks reach up to 100% false positive rates and up to 96% false negative rates. Format-targeted poisoning hit 100% FPR on MCQ and JSON inputs; entity-targeted poisoning flagged ~95–98% of benign queries mentioning a chosen product while leaving similar entities mostly untouched; concept backdoors reached 96% FNR on harmful queries and transferred to triggers never seen in training. Tests used Llama Guard 4 (12B) and Prompt Guard 2 (86M) as victims, with Gemini 2.5/3 as the proliferation model (the authors note GPT and Claude refuse the proliferation task).

The broader lesson is the uncomfortable one: a sample-efficient defense that learns from in-the-wild data inherits the trust problems of that data. The authors frame a three-way dilemma — RR cannot at once offer fast adaptation, utility-preserving generalization, and robustness to training-data manipulation.

Defenses

The paper evaluates two mitigations and is candid that neither is a complete fix.

Inspect references before proliferation. A guardrail LLM screening incoming references (a PromptArmor-style filter) caught much but not all poisoning — about 10% aggregate false-negative rate on poisoned references, and weaker on harder templates. Clean and poisoned jailbreaks look alike, making calibration hard. Treat it as defense-in-depth, not a gate.
Use a prompt-injection-resistant proliferation model. Swapping in an injection-hardened model (Meta SecAlign 70B) drove the targeted false-positive rate from 98% to 0% against the paper’s static template. The authors caution this is a lower bound: an adaptive attacker could do considerably better.
Treat the safety pipeline as attackable. The structural takeaway is that proliferation-based loops must be hardened before deployment. Separate trust domains between “data that becomes training labels” and “data submitted by untrusted parties,” cap any single reference’s amplification, and monitor the classifier for sudden distribution shifts in what it flags.
Watch for the failure signatures. Spikes in benign refusals tied to a specific format, subject, or entity — or quiet drops in detection when an unusual concept is present — are consistent with this class of poisoning and worth alerting on.

Status

Item	Detail
Paper	arXiv:2606.16242, posted 2026-06-15
Target defense	Rapid Response proliferation pipeline (Peng et al., 2024)
Threat model	Attacker modifies only jailbreak samples; no label or benign-data control
Poisoning rate	~1% of training set (≈18 references)
Reported impact	Up to 100% false positives; up to 96% false negatives
Classifiers tested	Llama Guard 4 (12B), Prompt Guard 2 (86M)
Disclosure	Authors state they disclosed to potentially affected parties and withheld an operational recipe

The point is not that Rapid Response is broken. It is that a defense which trains on untrusted, in-the-wild data is itself a target — and that any adaptive safety mechanism should be red-teamed as an attack surface before it ships, not after.