JAILBREAK
(12)12 hack(s).
CTF-framing jailbreaks: the prompt leaks into the attack
Sysdig (June 15, 2026) caught operators jailbreaking their own coding assistants by framing exploit requests as CTF or CVE-hunting — and the framing bleeds into User-Agents, passwords and IAM logs, leaving a cheap defender fingerprint.
Cognitive overload: how low image resolution jailbreaks multimodal LLMs
A May 2026 paper (Findings of ACL 2026) shows that lowering the resolution of text rendered as an image pushes frontier MLLMs into an 'Attack Comfort Zone' where safety alignment collapses while OCR stays accurate.
RL jailbreaking: reward shape and episode length drive the attack
A June 2026 study deconstructs reinforcement-learning jailbreaking and finds the attacker's environment design — dense rewards and long episodes — matters more than the RL algorithm.
UniAttack: one automated jailbreak that targets layered LLM defenses
A June 2026 preprint builds an automated, strategy-mixing red-teaming framework and runs it against models with different stacked defenses — finding that layering guardrails does not guarantee robustness.
Adaptive jailbreaks keep breaking LLM defenses: the evaluation gap
A June 2026 framework, UniAttack, composes reusable attack features into one-shot jailbreaks that transfer across models and defenses — a reminder that any defense tested only against static attacks gives false assurance.
IICL: pattern completion beats safety alignment with 10 examples
An April 2026 arXiv paper turns a model's own in-context learning against it: about ten abstract-operator examples make GPT-5.4 complete a harmful pattern its content filters never flag.
Para-jailbreaking: when 'safe completions' leak harm in the alternatives
An April 27, 2026 arXiv paper names a new failure mode of output-centric safety: a model can correctly refuse the direct question yet leak harmful content inside the 'safe alternative' it offers instead.
Multi-clip video jailbreaks: why video inputs break multimodal LLM safety
A June 2026 ACL paper shows the video channel is a weaker safety boundary than images: attack success climbs as a video is split into more diverse short clips.
CodeSpear: when grammar-constrained decoding becomes a jailbreak surface
A June 10, 2026 arXiv paper shows that the reliability feature forcing LLM code output to be syntactically valid can itself be turned into a jailbreak. Applying a benign code grammar can bypass refusals; the authors' CodeShield defense answers with honeypot code.
Sockpuppeting: a one-line prefill that jailbreaks 11 production LLMs
A line of code injected as the last assistant message coaxes 7 of 10 major models into harmful completions. The fix is not at the model — it is API-side message-order validation.
Mathematical encoding jailbreaks: when set theory bypasses LLM safety
An arXiv paper posted on May 5, 2026 shows that re-expressing a harmful prompt as a set-theory or formal-logic problem bypasses safety training on 46–56% of attempts across eight frontier models — but only when a helper LLM does the reformulation, not when mathematical syntax is bolted on top.
Many-shot jailbreaking: 256 examples to bypass any alignment
Anthropic researchers showed that stuffing the context window with 256 fake Q&A examples reliably bypasses safety training. Bigger context = bigger attack surface.