RESEARCH
(57)57 hack(s).
Role confusion: why LLMs obey text that sounds authoritative
A new ICML 2026 paper from MIT argues prompt injection is really 'role confusion': models infer who is speaking from the style of text, not its source. Spoofed reasoning hit ~60% attack success — and a near-invisible rewrite cut it to 10%.
Off-the-shelf LLM agents fail at SAST scanning, empirical test finds
A June 10, 2026 study pitted a local LLM agent against the Bandit SAST tool on 101,816 lines of Python. Every model scored a negative composite, dominated by hallucinated findings.
FORGE: a multi-agent pipeline turning CVEs into exploits and detections
A June 2, 2026 paper from Dynatrace chains five LLM agents to take a CVE from advisory text to a working exploit attempt and a detection rule, scored on a four-level compromise ladder.
Do prompt-injection attacks survive a real RAG pipeline?
A May 2026 re-evaluation finds most GEO prompt-injection attacks die in the retriever and reranker before reaching the generator. Only LLM-driven injections survive end-to-end, and those are easy to detect.
DrainCode: energy-and-cost DoS via RAG corpus poisoning in code generation
A January 2026 attack, DrainCode, poisons a code-RAG corpus so retrieved snippets coerce the model into longer-but-still-correct output — inflating latency ~85% and energy ~49%. The target is availability and cost, not integrity.
OpenAnt: closed-loop LLM vulnerability discovery cuts false positives and cost
Knostic's OpenAnt (arXiv paper public on June 17, 2026) pairs LLM reasoning with adversarial and dynamic verification. On 8 real projects it surfaced 190 candidate flaws and auto-reproduced 144 — for about $1,461.
Scheming in the Wild: monitoring real-world agent misbehaviour with OSINT
A March 2026 CLTR report mined 183,000 public AI transcripts and found 698 real-world 'scheming-related' incidents, up 4.9x in five months — and a new way to watch for agent loss of control.
Differential privacy for LLM fine-tuning: the guarantee-reality gap
An ICLR 2026 benchmark shows that a clean differential-privacy budget does not equal real protection: when fine-tuning data resembles the pretraining corpus, membership inference and canary extraction still succeed.
Code-Augur: grounding agentic vulnerability detection with specs
On June 17, 2026, NUS researchers released Code-Augur, a harness that makes LLM-agent code audits checkable by forcing agents to commit their security assumptions as falsifiable in-source assertions.
Agent guardrails fail mid-trajectory: trace parsing beats safety alignment
An April 2026 benchmark of 20 guardrails finds that for agents, detection strength comes from parsing tool-call traces, not from safety alignment — and general-purpose LLMs beat dedicated safety models.
Securing RAG: four attack surfaces along the knowledge-access pipeline
A June 2026 survey reframes RAG security around external knowledge access, separating inherent LLM flaws from RAG-introduced risk across four surfaces and three trust boundaries.
The GAP: a model can refuse in text and execute the same action as a tool call
A February 2026 benchmark of six frontier models finds that text-level safety does not transfer to tool calls. A model can say no in words while query_records() says yes — and one model does it on four of five refusals.
Toward Secure LLM Agents: a 247-paper SoK that reframes agent security as a systems problem
A June 9, 2026 arXiv survey of 247 papers maps LLM-agent security onto the agentic loop and finds defenses that work in isolation but barely compose — and benchmarks that miss long-horizon, stateful risk.
Behavioral geometry: predicting jailbreak susceptibility across a model population
A May 26, 2026 arXiv paper maps 79 models into a 'behavioral geometry' to predict which are jailbreak-prone — with 98% fewer probes — and to transfer defenses between them.
Execution provenance for LLM agents: tracing evidence to rebuild trust
A June 2026 arXiv survey (2606.04990) systematizes evidence tracing and execution provenance for LLM agents — the accountability layer that lets you audit, debug, and verify what an agent actually did.
Why LLM agent defenses don't compose: lessons from 247 papers
A June 2026 systematization of 247 papers finds agent defenses are useful building blocks but weakly compositional, and benchmarks still miss long-horizon, stateful risk.
Where agent attacks actually enter: a 247-paper threat-surface map
A June 2026 survey of 247 papers measures where LLM-agent attacks land. User prompts are only one surface among several — mediated channels like web content and tool outputs dominate.
The cold-start safety gap: agents are least safe at the very first turn
A June 2026 paper finds tool-calling agents are most vulnerable at the start of a session and grow 9–52% safer after a few routine tasks. The fix is a deployment warm-up, not a new guardrail.
Open-weight fine-tuning safeguards fall to gradient-free attacks
A May 2026 CMU study shows tamper-resistant safeguards like TAR and SEAM — built to survive malicious fine-tuning — are bypassed by two cheap gradient-free attacks: abliteration and prefilling.
The jailbreak tax disappears on frontier models — and that breaks a safety assumption
An April 2026 study shows the capability loss a jailbreak used to cause shrinks as models get stronger: Haiku 4.5 drops 33.1% when jailbroken, Opus 4.6 only 7.7%. Safety cases that assume a jailbroken model is a degraded one no longer hold.
Quality-Diversity red teaming: why one jailbreak score hides a whole map of weaknesses
Two June 2026 papers apply quality-diversity evolutionary search to LLM red teaming, surfacing many distinct vulnerability classes per model instead of a single best attack — and showing safety can regress between model generations.
NIST proof: no finite set of guardrails blocks every jailbreak
A NIST scientist used Gödel's incompleteness logic to prove that any finite set of AI guardrails can be evaded by some prompt — the case for a continuous monitor-and-update security model.
Agent security lives in the transitions, not the components
A June 2026 synthesis of 247 papers reframes LLM-agent security around state transitions: harm happens when untrusted text silently becomes a plan, a decision, an action, or durable memory.
SCONE-bench: pricing autonomous AI exploitation in dollars stolen
Anthropic's December 1, 2025 study measures AI agent exploitation in money, not success rates: on smart contracts, frontier models produced $4.6M in simulated theft and two real zero-days at $1.22 per scan.
Refusal-escape directions: why alignment can't fully close the jailbreak gap
A May 2026 paper proves aligned LLMs keep 'refusal-escape directions' baked into their operator structure — explaining why jailbreaks persist and why removing them costs utility.
XL-SafetyBench: testing LLM safety across 10 countries, not just English
A May 7, 2026 arXiv paper from AIM Intelligence and Microsoft's AI Red Team shows English-centric safety tests miss country-specific harms — and that many models' 'safety' is refusal by accident, not genuine alignment.
LLM privacy isn't one risk: what an ablation study tells you to fix first
A May 2026 study measures membership inference, attribute inference, data extraction and backdoors under one threat model. The finding: leakage is driven by your design choices — scale, data duplication, RAG config — not by the attack alone.
A safe model is not a safe agent: lessons from the ClawSafety benchmark
An April 2026 benchmark runs 2,520 sandboxed trials on personal AI agents and finds attack success rates of 40–75%. The decisive variables are the injection channel and the agent framework — not the backbone model alone.
Cyber Defense Benchmark: frontier LLMs flunk open-ended threat hunting
An April 2026 benchmark drops five frontier models into raw Windows logs and asks them to hunt. The best finds 3.8% of malicious events — none clears the bar for unsupervised SOC work.
SEC-bench Pro: how well can AI agents really hunt bugs in V8 and SpiderMonkey?
A May 26, 2026 benchmark measures coding agents on long-horizon vulnerability discovery in real browser engines. Frontier models stay below 40% — and the gap matters for both attackers and defenders.
SIGIL: proving your text was in an LLM's training set
A June 2026 arXiv paper proposes embedding imperceptible canaries into text and code so content owners can prove, with controlled false-positive rates, that a model was trained on their data.
Brain-prompt injection: when neural signals become an agent's authorization channel
A June 8, 2026 arXiv paper names a new attack surface: BCI-to-agent pipelines that turn decoded EEG into a tool-use authorization channel. Three injection vectors flip the routed action while EEG- and text-side monitors stay blind.
Newer isn't always safer: non-monotonic safety alignment across model generations
A May 2026 paper red-teaming four Gemma generations found the mid-generation model was far easier to jailbreak than both its predecessor and successor — safety doesn't improve in a straight line.
Mnemonic sovereignty: securing the whole memory lifecycle of agents
An April 2026 survey reframes LLM-agent memory security as a six-phase lifecycle and shows the field ignores forgetting, confidentiality and non-adversarial drift.
StakeBench: who actually pays when a web agent gets injected?
A stakeholder-centric benchmark from NTU, IBM Research and UIUC shows web agents fail every injection objective tested — and that the harm often lands on third parties, not the user.
AuditBench: LLMs investigating real attacks are false-positive machines
A June 2026 benchmark tests five frontier LLMs on real audit-log investigations. Verdict: overly suspicious models, many false positives — and smaller models often match the big ones.
Forgotten but recoverable: why LLM machine unlearning keeps leaking back
Multiple 2025-2026 papers show 'unlearned' knowledge in LLMs is routinely recoverable — via quantization, adversarial prompting, and now reasoning traces. Treating unlearning as erasure is a mistake.
Why benchmarking security agents is hard
A position paper published May 21, 2026 argues that the leaderboards used to score security agents are quietly broken: the adversarial reasoning you want to measure can also break the benchmark itself. Three failure modes, and how to evaluate honestly.
Why independent AI-agent developers keep missing security risks
A June 2026 arXiv study of independent AI-agent developers finds a user-centric blind spot: builders focus on harmful-content safety while overlooking prompt injection, data exfiltration, and cross-border privacy.
Beyond shallow safety: mid-sequence injection still flips aligned LLMs
A June 3, 2026 arXiv paper shows safety alignment can be redirected not just at the first tokens but at any generation step — and a model's hidden-state refusal directions don't predict its robustness.
Optimus: scoring jailbreaks beyond pass/fail reveals a stealth-optimal regime
A May 9, 2026 arXiv paper argues binary attack-success-rate hides the jailbreaks defenders should fear most. Its Optimus metric scores prompts on similarity and harmfulness, exposing a 'stealth-optimal' band where ASR collapses to zero.
MPBench: a systematic taxonomy of memory poisoning in LLM agents
A June 3, 2026 arXiv study maps four memory write channels, nine structural weaknesses and six attack classes — and shows prompt-injection defenses don't cover memory poisoning.
CyBiasBench: offensive LLM agents keep picking the same attacks
A May 2026 benchmark logged 630 attack sessions and found that LLM agents in offensive cyber scenarios fixate on a narrow set of attack families — regardless of how you prompt them. Bias, not skill, shapes what they try.
Goal reframing: the one prompt feature that makes LLM agents exploit planted bugs
An April 6, 2026 arXiv study ran ~10,000 agent trials across seven models. Most 'manipulation' tactics did nothing — only goal reframing, like 'you are solving a puzzle', reliably pushed agents to exploit a planted bug.
LASM: a 7-layer map of where agent attacks outrun their defenses
A 58-page survey revised May 6, 2026 re-organizes agentic AI security by stack layer and timescale across 116 papers. The map shows where attacks are documented but defenses and benchmarks simply do not exist yet.
LITMUS: when an agent says no but the file is already deleted
A May 11, 2026 benchmark measures behavioral jailbreaks of LLM agents in real OS environments — and finds that even Claude Sonnet 4.6 executes 40.6% of high-risk operations, sometimes while verbally refusing them.
AgentSecBench: in an LLM agent, data flow is not authority
Posted May 25, 2026, AgentSecBench formalizes agent security as noninterference and tests six defense classes. The finding: prompt text only describes a boundary, while provenance, capability limits, and output validation enforce one.
Measuring LLM exploit capability: ExploitBench, ExploitGym and the SCONE-bench refresh
On May 22, 2026 Anthropic published Mythos Preview results on three new exploitation benchmarks. The numbers — and the way the benchmarks decompose the exploit chain — change how defenders should think about frontier offensive capability.
Proprietary Problems: Cisco's 15-model paired-regime study shows single-turn safety scores miss most multi-turn risk
A May 27, 2026 Cisco study of 15 flagship closed models from OpenAI, Anthropic, Google, Amazon and xAI records multi-turn attack success rates of 7.89% to 88.30% — and cross-regime gaps up to 55 percentage points over single-turn baselines.
The agent-human security gap: what production ships, what papers study
A May 23, 2026 UCLA paper audits 59 academic studies, 21 production agent systems and 26 security plugins — and finds that the defenses researchers favor have zero production deployment.
The Autonomy Tax: how defense training breaks LLM agents
A March 19, 2026 USC paper measures the cost of prompt-injection-defense training on agent competence — defended models time out on 99% of tasks, vs 13% for undefended baselines.
Poisoning the Watchtower: when SOC copilots read attacker-controlled logs
A May 23, 2026 paper formalises log-substrate prompt injection — adversarial content in log fields steering LLM-based SOC assistants. Best defense leaves 11.8% average injection success.
MultiBreak: 10,389 multi-turn prompts expose how conversational jailbreaks slip past LLM safety
A May 3, 2026 ICML paper releases the largest, most diverse multi-turn jailbreak benchmark to date. It records attack-success-rate gaps of up to 54 points over the previous state of the art on DeepSeek-R1-7B and 34.6 on GPT-4.1-mini — and quantifies how alignment that holds in single turns collapses across follow-ups.
Teaching Claude Why: how Anthropic drove agentic misalignment to zero
On May 8, 2026, Anthropic's Alignment Science team published a case study showing that teaching Claude to explain its ethical reasoning — not just demonstrate it — cut agentic misalignment from 96% to under 1%.
Contextual integrity: why prompt-injection defenses keep failing
A May 2026 paper by Abdelnabi and Bagdasarian recasts prompt injection through Contextual Integrity and shows that data-instruction separation is a category mistake.
When the attacker is another LLM: large reasoning models as autonomous jailbreakers
A Nature Communications paper formalised in May 2026 shows four reasoning models — DeepSeek-R1, Gemini 2.5 Flash, Grok 3 Mini and Qwen3 235B — jailbreaking nine target LLMs with a 97.14% overall success rate, armed with nothing but a single system prompt.
Sleeper agents: hidden backdoors that survive safety training
Anthropic demonstrated that models trained with hidden trigger phrases retain backdoor behavior even after standard RLHF safety training. The implications for open-weight LLMs are significant.