JAILBREAK

(12)

12 hack(s).

CTF-framing jailbreaks: the prompt leaks into the attack

Sysdig (June 15, 2026) caught operators jailbreaking their own coding assistants by framing exploit requests as CTF or CVE-hunting — and the framing bleeds into User-Agents, passwords and IAM logs, leaving a cheap defender fingerprint.

2026-06-21//7 min

JAILBREAK MEDIUM NEW

Cognitive overload: how low image resolution jailbreaks multimodal LLMs

A May 2026 paper (Findings of ACL 2026) shows that lowering the resolution of text rendered as an image pushes frontier MLLMs into an 'Attack Comfort Zone' where safety alignment collapses while OCR stays accurate.

2026-06-21//6 min

JAILBREAK MEDIUM NEW

RL jailbreaking: reward shape and episode length drive the attack

A June 2026 study deconstructs reinforcement-learning jailbreaking and finds the attacker's environment design — dense rewards and long episodes — matters more than the RL algorithm.

2026-06-20//6 min

JAILBREAK MEDIUM NEW

UniAttack: one automated jailbreak that targets layered LLM defenses

A June 2026 preprint builds an automated, strategy-mixing red-teaming framework and runs it against models with different stacked defenses — finding that layering guardrails does not guarantee robustness.

2026-06-20//5 min

JAILBREAK MEDIUM NEW

Adaptive jailbreaks keep breaking LLM defenses: the evaluation gap

A June 2026 framework, UniAttack, composes reusable attack features into one-shot jailbreaks that transfer across models and defenses — a reminder that any defense tested only against static attacks gives false assurance.

2026-06-18//6 min

JAILBREAK MEDIUM

IICL: pattern completion beats safety alignment with 10 examples

An April 2026 arXiv paper turns a model's own in-context learning against it: about ten abstract-operator examples make GPT-5.4 complete a harmful pattern its content filters never flag.

2026-06-17//6 min

JAILBREAK MEDIUM NEW

Para-jailbreaking: when 'safe completions' leak harm in the alternatives

An April 27, 2026 arXiv paper names a new failure mode of output-centric safety: a model can correctly refuse the direct question yet leak harmful content inside the 'safe alternative' it offers instead.

2026-06-16//6 min

JAILBREAK MEDIUM NEW

Multi-clip video jailbreaks: why video inputs break multimodal LLM safety

A June 2026 ACL paper shows the video channel is a weaker safety boundary than images: attack success climbs as a video is split into more diverse short clips.

2026-06-14//6 min

JAILBREAK MEDIUM NEW

CodeSpear: when grammar-constrained decoding becomes a jailbreak surface

A June 10, 2026 arXiv paper shows that the reliability feature forcing LLM code output to be syntactically valid can itself be turned into a jailbreak. Applying a benign code grammar can bypass refusals; the authors' CodeShield defense answers with honeypot code.

2026-06-12//6 min

JAILBREAK MEDIUM NEW

Sockpuppeting: a one-line prefill that jailbreaks 11 production LLMs

A line of code injected as the last assistant message coaxes 7 of 10 major models into harmful completions. The fix is not at the model — it is API-side message-order validation.

2026-05-28//7 min

JAILBREAK MEDIUM

Mathematical encoding jailbreaks: when set theory bypasses LLM safety

An arXiv paper posted on May 5, 2026 shows that re-expressing a harmful prompt as a set-theory or formal-logic problem bypasses safety training on 46–56% of attempts across eight frontier models — but only when a helper LLM does the reformulation, not when mathematical syntax is bolted on top.

2026-05-25//7 min

JAILBREAK CRITICAL

Many-shot jailbreaking: 256 examples to bypass any alignment

Anthropic researchers showed that stuffing the context window with 256 fake Q&A examples reliably bypasses safety training. Bigger context = bigger attack surface.

2026-05-15//6 min