ADVERSARIAL MEDIUM NEW

SlotGCG: adversarial token position, not just content, drives jailbreaks

A June 2026 paper shows GCG-style jailbreaks get ~14% stronger when adversarial tokens are placed at attention-correlated slots inside the prompt — and keep 42% more success under input filtering.

2026-06-08 // 6 min affects: open-weight-llms

What is this?

SlotGCG is an optimization-based jailbreak technique published on arXiv in June 2026 (2606.05609) by researchers at Dongguk University, Seoul. It revisits a long-standing assumption in adversarial-suffix attacks: that the suffix — the end of the prompt — is the best place to put optimized adversarial tokens.

The reference attack here is GCG (Greedy Coordinate Gradient, Zou et al., 2023), which appends a string of gradient-optimized tokens to a harmful request to push an aligned model toward complying. Every GCG variant since has kept the tokens at the tail. SlotGCG’s finding is simple and uncomfortable: where you insert the adversarial tokens matters as much as what they are, and the suffix is frequently not the most vulnerable position.

How it works

The paper generalizes the insertion point into the notion of a slot. For a prompt of length L, there are L+1 candidate slots — one before the first token, one between every pair of tokens, and one after the last token. GCG only ever uses the final slot.

SlotGCG instead scores all slots with a Vulnerable Slot Score (VSS), a metric that estimates how susceptible each position is to adversarial insertion, then concentrates the optimization on the highest-scoring slots. The procedure is attack-agnostic: it is a position-search front-end that the authors say can be bolted onto any optimization-based attack, adding only about 200 ms of preprocessing.

No payloads are reproduced here — the canonical reference is the paper itself. The conceptual shape is what matters:

Classic GCG:   [ harmful request ] [ optimized suffix ]
                                    └── only ever here

SlotGCG:       [ ... ] [REDACTED] [ ... ] [REDACTED] [ ... ]
                       └── inserted at the highest-VSS slots,
                           which are usually NOT the suffix

Two results from the exploratory study are the actual story:

Vulnerable slots track the model’s attention. The positions that are easiest to attack correlate strongly with the model’s attention pattern over the input. They stay vulnerable even when the inserted tokens change — which means the weakness is a property of the position, not of a specific magic string. Each prompt, the authors argue, inherently contains its own vulnerable slots.
The gains are measurable. Averaged across the GCG-based methods and models they tested, choosing high-VSS slots yields about a 14% increase in attack success rate, converges in fewer optimization steps, and — critically for defenders — retains 42% higher success under input-filtering defenses.

Why it matters

The headline is not “a new jailbreak.” GCG has been public since 2023. The headline is that a whole class of defenses was implicitly tuned for the wrong place.

Many practical guardrails assume adversarial noise lives at the end of the prompt: perplexity checks weighted toward the tail, suffix stripping, “trim anything after the user’s question.” SlotGCG distributes perturbations across attention-correlated slots throughout the prompt, which is exactly why it keeps 42% of its effectiveness against input filtering that a suffix-only attack would lose. If your input-side defense was validated against vanilla GCG, that validation may not transfer.

The attention correlation also matters for detection research. It suggests the vulnerability is structural — tied to how the transformer weights its input — rather than a quirk of any one optimized suffix. That is good news for principled defenses (there is a signal to monitor) and bad news for string-matching ones (there is no fixed string to block).

Scope check: GCG and SlotGCG are white-box attacks that need gradient access, so the direct target is open-weight models you host or fine-tune yourself. The original GCG work showed that optimized suffixes can transfer to closed models, but SlotGCG’s position search is a white-box procedure. Treat it primarily as a sharper red-teaming tool against models you run, and as evidence that alignment alone is not a deployment control.

Defenses

Stop defending only the suffix. Apply perplexity and anomaly checks across the whole sequence, with a sliding window, not just the tail. SlotGCG’s 42% retained success exists precisely because suffix-focused filters miss mid-prompt perturbations.
Use input transformation, not just detection. Paraphrasing and retokenization (Jain et al., 2023) break the brittle, position-dependent token arrangements these attacks rely on, because they move or rewrite the very slots being targeted. They cost output quality, so apply them on high-risk paths.
Watch attention, not strings. Because vulnerable slots correlate with attention concentration, attention-pattern anomaly detection is a more durable signal than blocklisting suffixes. This is research-grade, but it is the direction the finding points to.
Layer the defense. Pair input-side measures with output-side refusal/safety classifiers and tool-call gating, so a jailbroken generation still has to clear a second check before it produces harm or triggers an action.
Gate open-weight and fine-tuned deployments. White-box gradient access is the precondition for this attack. Models you self-host are the realistic target — put them behind runtime guardrails and monitoring rather than trusting their built-in alignment.
Re-test your guardrails with position-varied attacks. If your red-team harness only runs suffix-based GCG, add slot-varied insertion to it. A guardrail that passes against vanilla GCG can still fail here.

Status

Item	Reference	Date	Notes
SlotGCG	arXiv 2606.05609	2026-06	Position-search front-end; VSS metric; +14% ASR, +42% ASR under input filtering
GCG (baseline)	arXiv 2307.15043	2023-07	Suffix-only adversarial optimization; the assumption SlotGCG breaks
Baseline defenses	arXiv 2309.00614	2023-09	Perplexity detection, paraphrase, retokenization, adversarial training

The takeaway for defenders: an input filter is only as good as the positions it inspects. SlotGCG is a reminder that “the attack is at the end of the prompt” was always an assumption, and assumptions are where guardrails quietly fail.