All hacks (375)

Automated prompt injection is model-dependent: TAP beats GCG, GPT-5 resists

A June 9, 2026 ETH Zurich study adapts GCG and TAP to AgentDojo across 80 agent task pairs. Black-box TAP beats gradient-based GCG, yet attacks tuned on small models fail to transfer to GPT-5.

2026-06-25//6 min

DATA LEAK CRITICAL NEW

DifyTap: four authorization flaws leak AI chats across Dify tenants

Zafran Labs disclosed four DifyTap flaws in Dify (June 22, 2026) — two critical, two unauthenticated, three cross-tenant — that let an attacker wiretap other customers' AI conversations and read their files. Three are fixed in 1.14.2.

2026-06-25//7 min

Over-privileged tool selection: agents reach for stronger tools than the task needs

A June 2026 paper and its benchmark ToolPrivBench show that mainstream LLM agents routinely pick higher-privilege tools when a weaker one would do — and that safety alignment does not fix it.

MemMark: attributing a poisoned agent memory from the snapshot alone

A May 26, 2026 arXiv paper embeds ownership into an agent's latent memory-write decisions, so provenance survives even when logs are erased and only the final memory snapshot remains.

Agent communication-graph metadata leaks the workflow before it runs

A June 5, 2026 arXiv paper shows that even with encrypted payloads, the A2A/MCP communication graph lets a passive observer predict an agent workflow's task class from its opening — and act before it completes.

Off-the-shelf LLM agents fail at SAST scanning, empirical test finds

A June 10, 2026 study pitted a local LLM agent against the Bandit SAST tool on 101,816 lines of Python. Every model scored a negative composite, dominated by hallucinated findings.

FORGE: a multi-agent pipeline turning CVEs into exploits and detections

A June 2, 2026 paper from Dynatrace chains five LLM agents to take a CVE from advisory text to a working exploit attempt and a detection rule, scored on a four-level compromise ladder.

PRAC: hijacking a computer-use agent's choice through its attention

An April 2026 Tübingen paper shows one imperceptibly perturbed product image can concentrate a computer-use agent's visual attention and steer 82% of its selections — without ever touching the output.

Cognitive Firewall: a split-compute defense for browser agents

A March 2026 eBay paper layers an on-device sentinel, a cloud planner and a deterministic execution guard to cut indirect prompt injection in browser agents from 100% to under 1%.

Do prompt-injection attacks survive a real RAG pipeline?

A May 2026 re-evaluation finds most GEO prompt-injection attacks die in the retriever and reranker before reaching the generator. Only LLM-driven injections survive end-to-end, and those are easy to detect.

DrainCode: energy-and-cost DoS via RAG corpus poisoning in code generation

A January 2026 attack, DrainCode, poisons a code-RAG corpus so retrieved snippets coerce the model into longer-but-still-correct output — inflating latency ~85% and energy ~49%. The target is availability and cost, not integrity.

Bucket squatting in Vertex AI: the "Pickle in the Middle" cross-tenant RCE

Unit 42 disclosed a Vertex AI Python SDK flaw (June 16, 2026): a predictable default staging bucket plus a missing ownership check let an attacker hijack a victim's model upload and gain cross-tenant code execution. Patched in v1.148.0.

OFFENSIVE AI CRITICAL NEW

1,000 captured agent logs: a low-skill attacker breached 14 firms with Claude and Codex

OALABS recovered over 1,000 Claude Code and Codex sessions from a careless attacker. Across all of them the frontier models raised only ten policy violations — the deskilling of intrusion, documented from the inside.

2026-06-22//7 min

LLMjacking evolves: stolen Ollama compute now drives autonomous attack agents

A June 17, 2026 Sysdig report documents a captured incident: an exposed, unauthenticated Ollama server used as the reasoning engine for a multi-stage offensive pipeline. The fix is operational, not model-side.

OpenAnt: closed-loop LLM vulnerability discovery cuts false positives and cost

Knostic's OpenAnt (arXiv paper public on June 17, 2026) pairs LLM reasoning with adversarial and dynamic verification. On 8 real projects it surfaced 190 candidate flaws and auto-reproduced 144 — for about $1,461.

2026-06-22//7 min

DeepMind's AI Control Roadmap: defense-in-depth for misaligned agents

Google DeepMind's AI Control Roadmap (June 2026) treats internal AI agents as potential insider threats, layering trusted-supervisor monitoring on top of model alignment.

Agent-Inflicted Damage: when AI agents wreck production with no attacker

Cyera's May 2026 study of 7,200+ AI incidents isolates 344 cases of agent-inflicted damage — 188 with no external attacker — where autonomous agents deleted databases, leaked secrets and burned budgets.

Image prompt reconstruction: rebuilding private images from distributed MLLM embeddings

A June 2026 paper shows a passive participant in a distributed multimodal-LLM pipeline can rebuild the user's input image from the intermediate embeddings it relays. Black-box, no model weights needed.

Agent skills are a supply chain: malware and prompt injection in SKILL.md

A February 2026 audit of ~4,000 agent skills found 13.4% with critical issues and 76 live malicious payloads. SKILL.md is now a software supply chain — here's how to triage it.

Criminal AI-as-a-Service in 2026: how the underground operationalizes cybercrime

A June 11, 2026 Rapid7 report finds the criminal AI market has shifted from 'evil chatbots' to a productivity layer: jailbreak wrappers, stolen accounts and deepfake-for-KYC services that scale ordinary crime.

Sleeper Memory Poisoning: dormant attacks on stateful LLM agents

A May 2026 paper shows attackers can plant fabricated 'memories' through a document or webpage that lie dormant, then steer an assistant's actions across many later sessions.

Mastra npm scope takeover: a dormant maintainer account poisons an AI agent framework

On June 17, 2026, a forgotten contributor account republished the entire @mastra npm scope — ~142 packages — with one malicious dependency that drops a crypto stealer and RAT. A stale credential, not a zero-day.

AutoJack: a browsing agent turns a malicious webpage into host RCE

Microsoft's June 18, 2026 AutoJack research shows a web-browsing AI agent inheriting localhost identity to reach a local MCP WebSocket and spawn arbitrary processes on the host.

CVE-2026-32211: missing authentication in Azure MCP Server

Microsoft disclosed CVE-2026-32211 on 2 April 2026 — a missing-authentication flaw in Azure MCP Server that lets an unauthenticated attacker disclose information over the network. Microsoft scored it 9.1; NVD, 7.5.

LLM salting: rotating the refusal direction to break jailbreak reuse

SophosAI's 'LLM salting' (CAMLIS 2025) applies a small rotation to a model's refusal direction so that a jailbreak precomputed against the base model no longer transfers to your deployment — the rainbow-table defense, applied to LLMs.

Message-object injection: the serialization gap in AI assistants

Imperva showed (June 10, 2026) that contacts, vCards and location pins get flattened inline into an AI assistant's prompt with no untrusted-content boundary — a structural injection vector, patched in OpenClaw 2026.4.23.

CTF-framing jailbreaks: the prompt leaks into the attack

Sysdig (June 15, 2026) caught operators jailbreaking their own coding assistants by framing exploit requests as CTF or CVE-hunting — and the framing bleeds into User-Agents, passwords and IAM logs, leaving a cheap defender fingerprint.

Cognitive overload: how low image resolution jailbreaks multimodal LLMs

A May 2026 paper (Findings of ACL 2026) shows that lowering the resolution of text rendered as an image pushes frontier MLLMs into an 'Attack Comfort Zone' where safety alignment collapses while OCR stays accurate.

WAAA: how agentic browsers resurrect classic web attacks

A May 2026 paper builds the first web-focused threat model for agentic browsers and shows that 10 long-mitigated web attacks come back — often amplified — because the agent is a confused deputy that cannot tell a task step from a web trap.

Scheming in the Wild: monitoring real-world agent misbehaviour with OSINT

A March 2026 CLTR report mined 183,000 public AI transcripts and found 698 real-world 'scheming-related' incidents, up 4.9x in five months — and a new way to watch for agent loss of control.

Defensive misdirection: why blocking automated jailbreaks can backfire

A June 2026 paper models the attacker's automated judge and shows that predictable refusals feed the search loop — proposing controlled misdirection instead of plain blocking.

DATA LEAK CRITICAL NEW

GeminiJack: zero-click exfiltration from Gemini Enterprise via prompt injection

Disclosed December 2025, GeminiJack let a single shared Doc, calendar invite or email silently exfiltrate Gmail, Calendar and Docs data through Gemini Enterprise's RAG — the enterprise-RAG exfiltration class OWASP now ranks first.

Overeager Coding Agents: Out-of-Scope Actions on Benign Tasks

Two May 2026 benchmarks measure coding agents that overstep on benign requests — deleting files, wiping credentials — and find the agent framework, not the model, drives the risk.

Tool selection hijacking: forcing an agent to pick the attacker's tool

An NDSS 2026 attack and an April 2026 IBM paper target the same blind spot: the step where an agent chooses which tool to call. Poison the catalog and the agent picks yours, with 70–100% success.

DATA LEAK LOW NEW

Capability vs propensity: auditing LLM training-data leakage

A June 2026 framework, PropMe, separates what a model CAN leak under attack from what it WILL leak in ordinary use. The gap is wide — and audits that ignore it misstate real-world risk.

CVE-2026-0755: command injection and file theft in gemini-mcp-tool

A June 18, 2026 advisory details how the popular gemini-mcp-tool let untrusted prompt input reach the shell and the Gemini CLI @file parser — CVSS 9.8 RCE and arbitrary file exfiltration, fixed in 1.1.6.

Backdoor unlearning generalizes: removing one trigger can suppress others

A June 2026 paper shows that teaching an LLM to ignore one backdoor trigger can also weaken other, never-targeted backdoors — when their internal activation shifts are close, measured by a new metric called CASD.

Why agent refusals fail: the Cybersecurity Refusal Framework

A new benchmark shows agent safety refusals key off the URL string, not the real target. Two trivial tricks — fake 'rules of engagement' and localhost proxying — flip refusal into compliance on production sites.

MCP security: stop asking which attacks exist, ask where defenses must live

An April 2026 arXiv paper maps MCP attacks across six architectural layers and finds defenses are uneven and disproportionately tool-centric — leaving host orchestration, transport and supply-chain layers structurally under-defended.

2026-06-20//7 min

TRAP: persuasion techniques turn web agents against their own task

An Oxford benchmark updated on arXiv in June 2026 shows web agents obey Cialdini-style persuasion hidden in page elements, abandoning their task in 25% of cases on average and up to 43% for the weakest model.

NRT-Bench: multi-turn red-teaming of LLM agents that run a plant

A June 18, 2026 benchmark puts LLM operator agents in a simulated nuclear control room. Adaptive multi-turn attacks pushed the team past a safety limit in 8.7-12.1% of sessions — and the failures barely overlap across models.

RL jailbreaking: reward shape and episode length drive the attack

A June 2026 study deconstructs reinforcement-learning jailbreaking and finds the attacker's environment design — dense rewards and long episodes — matters more than the RL algorithm.

INFRASTRUCTURE MEDIUM NEW

UniAttack: one automated jailbreak that targets layered LLM defenses

A June 2026 preprint builds an automated, strategy-mixing red-teaming framework and runs it against models with different stacked defenses — finding that layering guardrails does not guarantee robustness.

2026-06-20//5 min

vLLM SSRF: when the allowlist patch carried the same parser bug

Two vLLM advisories show the same flaw twice: a host allowlist validated with one URL parser and fetched with another. The fix swapped the parser pair and reopened the bypass.

Service-side exfiltration via deep research agents

A hidden instruction in a single email made ChatGPT's Deep Research agent leak inbox data from OpenAI's own cloud — no rendering, no user action, invisible to network defenses. Here is the class and how to contain it.

RAGFlow CVE-2026-45312: a prompt template that runs OS commands

A Jinja2 template injection in RAGFlow's prompt generator turns a user-controlled prompt field into server-side RCE. CVSS 9.9, disclosed May 9, 2026.

Differential privacy for LLM fine-tuning: the guarantee-reality gap

An ICLR 2026 benchmark shows that a clean differential-privacy budget does not equal real protection: when fine-tuning data resembles the pretraining corpus, membership inference and canary extraction still succeed.

When the AI reviewer can't read the figure: cross-modal attacks on peer review

A June 2026 arXiv paper (PaperGuard) shows AI peer reviewers are vulnerable not only through text but through figures — black-box prompt injection and white-box image perturbations both flip verdicts.

Code-Augur: grounding agentic vulnerability detection with specs

On June 17, 2026, NUS researchers released Code-Augur, a harness that makes LLM-agent code audits checkable by forcing agents to commit their security assumptions as falsifiable in-source assertions.

Localizing prompt injection: from detection to forensic excision

Detecting a prompt injection only tells you something is wrong. Two 2026 papers, PromptLocate and WebSentinel, pinpoint exactly which span of context is poisoned so it can be excised and the task recovered.

ChatGPhish: untrusted Markdown turns ChatGPT summaries into phishing

Permiso disclosed ChatGPhish on 29 May 2026: a web page you ask ChatGPT to summarize can render attacker links, fake alerts, QR codes and tracking pixels inside the trusted assistant UI.

SEAgent: mandatory access control to contain agent privilege escalation

A January 2026 paper reframes agent attacks as privilege escalation — actions exceeding the least privilege a task needs — and proposes SEAgent, a deterministic MAC/ABAC layer that enforces policy over an information-flow graph.

Vertex AI 'Double Agents': over-privileged service agents as a cloud escalation path

Unit 42 showed (31 March 2026) that a Vertex AI Agent Engine deployment exposes an over-scoped service-agent credential via the metadata service — turning a misconfigured agent into a path to read every bucket in the project.

Stored prompt injection: when an injection outlives the session

A June 2026 arXiv paper reframes prompt injection as a stored, cross-session problem: once adversarial text lands in an agent's persistent state, it can steer executions long after the attacker is gone.

An LLM agent that pentests Salesforce Experience Cloud end-to-end

On June 8, 2026, Reco published an agent that maps, fuzzes and exploits Salesforce Experience Cloud sites with no human in the loop — the same misconfigurations ShinyHunters has been mining since 2025, now driven by a model.

Agent guardrails fail mid-trajectory: trace parsing beats safety alignment

An April 2026 benchmark of 20 guardrails finds that for agents, detection strength comes from parsing tool-call traces, not from safety alignment — and general-purpose LLMs beat dedicated safety models.

MemPoison: backdooring agent memory through ordinary conversation

A May 2026 arXiv paper plants a triggerable backdoor in an LLM agent's long-term memory just by chatting with it — and is engineered to survive the selective extraction and rewriting stages meant to filter poisoned content.

Securing RAG: four attack surfaces along the knowledge-access pipeline

A June 2026 survey reframes RAG security around external knowledge access, separating inherent LLM flaws from RAG-introduced risk across four surfaces and three trust boundaries.

The GAP: a model can refuse in text and execute the same action as a tool call

A February 2026 benchmark of six frontier models finds that text-level safety does not transfer to tool calls. A model can say no in words while query_records() says yes — and one model does it on four of five refusals.

On-device isn't safer: indirect injection hits local and cloud LLMs alike

Brave's June 8, 2026 research shows indirect prompt injection works identically against a cloud browsing agent (Mozilla Tabstack) and an on-device autocomplete (Cotypist) — local hosting is not a mitigation.

Agent libOS: make the runtime, not the tool wrapper, the authority boundary

A June 2, 2026 arXiv paper argues most agent frameworks conflate tool visibility with resource authority — and proposes a library-OS runtime where capability checks live at primitive boundaries, not in tool wrappers.

Cordon: transactional containment for tool-using LLM agents

A June 16, 2026 arXiv paper proposes 'semantic transactions': a runtime that stages an agent's irreversible tool effects and validates the whole task flow before any commit.

AuthGraph: dual-graph alignment to catch agent prompt injection

A May 26, 2026 UCLA paper compares a clean authorization graph against the agent's actual provenance graph, cutting AgentDojo attack success from 40% to 1%.

INFRASTRUCTURE MEDIUM NEW

LangChain Core path traversal: legacy load_prompt reads arbitrary files

CVE-2026-34070 lets crafted prompt configs walk LangChain's filesystem via load_prompt, exposing .txt/.json/.yaml secrets. Disclosed March 27, 2026, fixed in langchain-core 1.2.22.

MCP Go SDK CSRF: a web page can trigger your local tools (CVE-2026-33252)

The official MCP Go SDK accepted cross-site browser POSTs without checking the Origin header. On an unauthenticated local server, any website you visit could invoke your tools. Patched in 1.4.1.

Error-path injection: when tool error messages carry implicit authority

A June 2026 paper (VATS) shows that injecting instructions inside tool error messages triples indirect-injection success on frontier agents — up to 100% compliance — because models treat error output as authoritative.

Rapid Poison: turning a jailbreak defense into an attack surface

A June 15, 2026 arXiv paper shows the proliferation step inside Rapid Response jailbreak defenses can be poisoned at a 1% rate — forcing up to 100% false positives or 96% false negatives in the guard classifier.

SkillAttack: automated red-teaming finds exploits in agent skills

An April 2026 paper, SkillAttack, reframes exploit discovery as a path-search problem and shows even well-intentioned agent skills are reachable — up to 0.93 attack success on adversarial skills.

Authority confusion: why tool-using agents misuse their own access

A May 2026 paper names a failure mode distinct from prompt injection: untrusted data should inform an agent's reasoning but never authorize side effects. AIRGuard enforces that line at action time.

FIRST's mid-year forecast: ~66,000 CVEs in 2026, but exploitable risk stays flat

On June 15, 2026, FIRST revised its 2026 CVE projection to ~66,000 — 46.3% above February — driven mainly by AI-assisted discovery. The actionable subset triaged by EPSS and CISA KEV has not grown at the same rate.

Chat templates are code: Jinja2 SSTI in LLM inference servers

CERT/CC's VU#915947 (April 20, 2026) documents CVE-2026-5760, a CVSS 9.8 RCE in SGLang: a malicious GGUF model file carries a Jinja2 chat template that runs Python on the server. It is the same class as Llama Drama and a vLLM flaw before it.

DATA POISONING MEDIUM NEW

Oracle poisoning: corrupting the knowledge graph an agent reasons over

A paper published on arXiv on May 10, 2026 defines Oracle Poisoning: corrupt the knowledge graph an agent queries at runtime and it reaches wrong conclusions through correct reasoning. Across nine models, trust in poisoned data hit 100% under directed agentic queries.

INFRASTRUCTURE MEDIUM NEW

The serving layer is the attack surface: concurrency bugs in vLLM and SGLang

A May 2026 fuzzer, GRIEF, treats concurrent request traces as inputs and finds 15 serving-layer bugs (2 CVEs) in vLLM and SGLang: cross-request output contamination, noisy-neighbor DoS, and delayed crashes — no malformed input required.

CVE-2026-26268: Cursor's agent turns a git checkout into code execution

A malicious repo hides a bare Git repository with an automatic hook. When Cursor's AI agent runs git checkout to 'explain the codebase', the hook fires — arbitrary code execution on the developer's machine, no approval prompt. Patched in Cursor 2.5.

MalTool: when an AI writes the malicious tool your agent installs

Researchers used a coding LLM to synthesize 6,487 working malicious agent tools. VirusTotal missed most of them. The lesson: signature scanning is the wrong control for agent tool supply chains.

User-mediated attacks: when the user is the injection channel

A January 2026 study of 12 commercial agents shows attackers don't need to touch the agent. They trick a benign user into forwarding poisoned content — which the instruction hierarchy then promotes to trusted user intent. Default bypass rates topped 92%.

CVE-2026-26030: prompt injection becomes RCE in Microsoft Semantic Kernel

Microsoft's AI Red Team showed two Semantic Kernel flaws that turn a single injected prompt into host code execution. The lesson: any tool parameter the model can influence is attacker-controlled input. Patched May 7, 2026.

SearchGEO: making LLM search agents endorse attacker-published pages

A June 15, 2026 arXiv paper measures how attacker-controlled web content gets turned into an agent's endorsed recommendation — attack success ranges from 0% to 31.4% depending on the backend model.

LiteLLM CVE-2026-49468: a Host-header auth bypass in the gateway's own routing

Disclosed June 17, 2026, CVE-2026-49468 lets a crafted Host header desync LiteLLM's auth route from the route FastAPI runs — an app-layer repeat of BadHost, fixed in LiteLLM 1.84.0.

SkillVetBench: an LLM-as-Judge that catches what skill scanners miss

A June 14, 2026 arXiv paper shows code-layer skill scanners miss 89–100% of instruction-layer threats, while an LLM-as-Judge flags all 78 malicious test skills with zero false positives.

Toward Secure LLM Agents: a 247-paper SoK that reframes agent security as a systems problem

A June 9, 2026 arXiv survey of 247 papers maps LLM-agent security onto the agentic loop and finds defenses that work in isolation but barely compose — and benchmarks that miss long-horizon, stateful risk.

Zombie agents: when a self-evolving LLM agent stays compromised across sessions

A one-time indirect injection observed during a benign session can be written to an agent's long-term memory and later replayed as instruction — turning a transient prompt into persistent control. Attack paper dated February 2026, defense (CAMS) May 2026.

Behavioral geometry: predicting jailbreak susceptibility across a model population

A May 26, 2026 arXiv paper maps 79 models into a 'behavioral geometry' to predict which are jailbreak-prone — with 98% fewer probes — and to transfer defenses between them.

SafeMCP: look-ahead tool gating against power-seeking in MCP agents

A June 1, 2026 arXiv paper (ACL 2026) proposes SafeMCP, a server-side plugin that uses world-model look-ahead to filter hazardous tool acquisition before an MCP agent over-expands its powers.

Execution provenance for LLM agents: tracing evidence to rebuild trust

A June 2026 arXiv survey (2606.04990) systematizes evidence tracing and execution provenance for LLM agents — the accountability layer that lets you audit, debug, and verify what an agent actually did.

Ghost tool calls: speculative agent execution leaks user intent

A June 2026 arXiv paper (2606.02483) shows that agents which speculatively pre-issue tool calls to hide latency leak inferred user intent to external services — and that the leak is a timing problem no allow-list can undo.

The lethal trifecta is now the default — defend agents at runtime

The lethal trifecta once flagged risky agents. By mid-2026 it describes every useful one, so architecture-level avoidance no longer works. Defense shifts to five runtime behavioral signals.

AI Agent Traps: DeepMind's six-category map of how the web hijacks agents

Google DeepMind's 'AI Agent Traps' paper (SSRN, late March 2026) gives the first systematic taxonomy of adversarial web content that targets an agent's perception, reasoning, memory, action, multi-agent dynamics, and human overseer.

Adaptive jailbreaks keep breaking LLM defenses: the evaluation gap

A June 2026 framework, UniAttack, composes reusable attack features into one-shot jailbreaks that transfer across models and defenses — a reminder that any defense tested only against static attacks gives false assurance.

DoubtProbe: catching jailbreaks that reorganize intent

A June 2026 paper proposes an inference-time defense that treats jailbreak detection as a consistency check: rebuild the request under structural constraints, then flag the prompts whose meaning won't survive the round-trip.

2026-06-18//5 min

ShadowMerge: poisoning graph-based agent memory by colliding relations

A May 2026 paper poisons graph-based agent memory with relations that share a real anchor and channel but carry a conflicting value — reaching 93.8% attack success on Mem0 while input-side filters miss it.

2026-06-18//5 min

Secret Stealing: backdoored model code exfiltrates fine-tuning data

A 30 April 2026 paper shows that tampered model code — not poisoned weights — can steal API keys and PII from local fine-tuning data, reaching >98% recovery while bypassing DP-SGD and audits.

Black-Hole Attack: poisoning a vector database through embedding geometry

An April 7, 2026 paper shows a few vectors placed near the embedding centroid get pulled into up to 99.85% of top-10 results — a query-agnostic, model-agnostic poisoning of vector databases.

Why LLM agent defenses don't compose: lessons from 247 papers

A June 2026 systematization of 247 papers finds agent defenses are useful building blocks but weakly compositional, and benchmarks still miss long-horizon, stateful risk.

Membership inference via LLM tokenizers: a new privacy attack vector

A USENIX Security 2026 paper shows a model's tokenizer alone can leak which datasets were used in pre-training — a cheaper, model-free membership inference attack.

Browser agents leak their model identity through how they click

A May 14, 2026 paper shows the on-page actions of an LLM browser agent fingerprint the underlying model with up to 96% accuracy across 14 frontier models — no spoofable headers needed.

LiteLLM CVE-2026-47101→40217: low-privilege user to admin and RCE

Obsidian Security disclosed a three-bug LiteLLM chain (June 2026) that walks a default low-privilege user up to proxy_admin and remote code execution — a CVSS 9.9 takeover of the AI gateway.

MULTIMODAL MEDIUM NEW

Sirens' Whisper: inaudible near-ultrasonic jailbreaks of voice LLMs

A March 14, 2026 paper from Huazhong, Tsinghua and Microsoft hides jailbreak prompts in the 17–22 kHz band. Microphone nonlinearity demodulates them back into commands — silent to humans, up to 0.94 non-refusal on commercial voice LLMs.

Where agent attacks actually enter: a 247-paper threat-surface map

A June 2026 survey of 247 papers measures where LLM-agent attacks land. User prompts are only one surface among several — mediated channels like web content and tool outputs dominate.

JAILBREAK MEDIUM

IICL: pattern completion beats safety alignment with 10 examples

An April 2026 arXiv paper turns a model's own in-context learning against it: about ten abstract-operator examples make GPT-5.4 complete a harmful pattern its content filters never flag.

Detecting attacks in agent tool-call traffic: content beats graph

A May 2026 arXiv study of MCP tool-call monitoring finds content embeddings drive detection (AUROC > 0.89), graph structure adds little, and naive random splits inflate scores by up to 26 points.

The cold-start safety gap: agents are least safe at the very first turn

A June 2026 paper finds tool-calling agents are most vulnerable at the start of a session and grow 9–52% safer after a few routine tasks. The fix is a deployment warm-up, not a new guardrail.

RUBAS: rubric-based RL gives agent safety a fine-grained reward signal

A June 2026 paper replaces coarse refuse/comply rewards with four scored rubrics — tool-use, argument, response and helpfulness — to train tool-calling agents that stay safe without losing utility.

2026-06-17//5 min

Open-weight fine-tuning safeguards fall to gradient-free attacks

A May 2026 CMU study shows tamper-resistant safeguards like TAR and SEAM — built to survive malicious fine-tuning — are bypassed by two cheap gradient-free attacks: abliteration and prefilling.

MIRAGE: mobile GUI agents fooled by injected user-generated content

A May 2026 study shows VLM-driven mobile GUI agents can't tell trusted interface from user-generated content. Realistic text injected into comments and bios hijacks all five tested agents (23–30% success).

INDIRECT INJECTION CRITICAL NEW

LogJack: cloud logs as a prompt-injection channel against debugging agents

An April 2026 benchmark shows LLM debugging agents that read cloud logs and run fixes obey instructions hidden in log lines — verbatim command execution up to 86.2%, RCE on 6 of 8 models, and provider guardrails that miss almost everything.

The jailbreak tax disappears on frontier models — and that breaks a safety assumption

An April 2026 study shows the capability loss a jailbreak used to cause shrinks as models get stronger: Haiku 4.5 drops 33.1% when jailbroken, Opus 4.6 only 7.7%. Safety cases that assume a jailbroken model is a degraded one no longer hold.

Reasoning-extension DoS: when the AI guardrail becomes the attack surface

A June 2026 paper shows a single poisoned document can trap reasoning-based AI guardrails in extended thinking loops, slowing shared agent workflows by up to 148x. The target is availability, not integrity.

AI coding agents: attackers go for the credential, not the model

Six 2026 exploits against Codex, Claude Code, Copilot and Vertex AI all bypassed model-level defenses and reached the same target — the agent's runtime credentials. The root cause is an identity governance gap, not a prompt problem.

LiteLLM backdoored: when a poisoned CI scanner takes over the LLM gateway

In March 2026, attackers stole LiteLLM's PyPI publishing token by compromising Trivy inside its CI pipeline, then shipped two backdoored releases. The chain shows why the LLM gateway is a high-value supply-chain target.

2026-06-17//7 min

Reprompt: one-click Copilot data exfiltration via prefilled-URL prompts

A patched Copilot Personal flaw chained a prefilled-URL prompt, a guardrail that only checked the first request, and server-driven follow-ups into stealthy one-click data exfiltration. The bypass lessons generalise.

LangGraph checkpointers: from SQL injection to RCE on self-hosted agents

Check Point Research chained a SQL injection in LangGraph's checkpointer with an unsafe msgpack deserialization to reach remote code execution. Disclosed June 11, 2026; all three CVEs are patched.

2026-06-17//7 min

Termination poisoning: trapping LLM agents in unbounded loops

A May 2026 arXiv paper shows that injected prompts can distort an agent's own 'am I done?' judgment, forcing unbounded computation. The LoopTrap framework reports up to 25x step amplification.

Side channels on LLM inference: your prompts leak despite TLS

Speculative decoding and streaming responses create traffic patterns that leak prompt topics, languages, even PII — through encrypted connections. A look at three papers and the defenses.

M3Att: query-agnostic knowledge poisoning of medical multimodal RAG

A May 2026 paper poisons medical image-text RAG without knowing user queries in advance. Imperceptible image perturbations hijack retrieval; ambiguity-guided text evades the model's self-correction — and pre-filter defenses barely dent it.

SkillGuard: a permission framework that governs what an agent skill can do at runtime

A June 2026 paper closes the gap between what a skill injects into an agent's context and what it makes the agent do, using manifests, deny-by-default access control and runtime monitoring.

EU AI Act: how the draft guidelines classify agentic systems as high-risk

The European Commission's 19 May 2026 draft guidelines on Article 6 say agentic AI systems must be assessed as a whole — a single narrow component can pull the entire configuration into the high-risk regime.

Quality-Diversity red teaming: why one jailbreak score hides a whole map of weaknesses

Two June 2026 papers apply quality-diversity evolutionary search to LLM red teaming, surfacing many distinct vulnerability classes per model instead of a single best attack — and showing safety can regress between model generations.

Dummy backdoors: removing unknown LLM backdoors via shared internal mechanisms

A June 2026 paper removes hidden backdoors you can't see by planting one you can: different backdoors share internal activation patterns, so deleting a controllable 'dummy' weakens the unknown one too.

Semantic Compliance Hijacking: payload-less agent skills that scanners can't see

A May 14, 2026 arXiv paper shows a skill file with no code and no explicit harmful intent can steer a coding agent into writing its own malware at runtime — with a 0.00% detection rate against current scanners.

FragFuse: fragmented queries that bypass LLM agent access control

A June 14, 2026 arXiv paper shows a banned request can be split into benign fragments, parked in an agent's long-term memory, then fused at retrieval time — bypassing access controls 86.3% of the time.

NIST proof: no finite set of guardrails blocks every jailbreak

A NIST scientist used Gödel's incompleteness logic to prove that any finite set of AI guardrails can be evaded by some prompt — the case for a continuous monitor-and-update security model.

Langflow CVE-2026-5027: unauthenticated file write to RCE under active attack

A path traversal in Langflow's /api/v2/files endpoint lets an unauthenticated request write files anywhere on disk. VulnCheck confirmed in-the-wild exploitation on June 9, 2026; ~7,000 instances are exposed.

Agent security lives in the transitions, not the components

A June 2026 synthesis of 247 papers reframes LLM-agent security around state transitions: harm happens when untrusted text silently becomes a plan, a decision, an action, or durable memory.

AI CEOs ask Congress to make DNA synthesis screening mandatory

On June 5, 2026, the heads of OpenAI, Anthropic, Google DeepMind and Microsoft AI co-signed a letter urging Congress to require nucleic-acid synthesis screening — framing it as a defensive control against AI-eroded bioweapon barriers.

Para-jailbreaking: when 'safe completions' leak harm in the alternatives

An April 27, 2026 arXiv paper names a new failure mode of output-centric safety: a model can correctly refuse the direct question yet leak harmful content inside the 'safe alternative' it offers instead.

SCONE-bench: pricing autonomous AI exploitation in dollars stolen

Anthropic's December 1, 2025 study measures AI agent exploitation in money, not success rates: on smart contracts, frontier models produced $4.6M in simulated theft and two real zero-days at $1.22 per scan.

INDIRECT INJECTION CRITICAL NEW

Agentjacking: fake Sentry errors hijack AI coding agents via MCP

Tenet Security's June 2026 research shows an attacker can plant a fake Sentry error that AI coding agents read over MCP and execute, exfiltrating credentials with an 85% success rate across 2,388 exposed orgs.

HAMLOCK: a backdoor split between the model and the chip

A USENIX Security 2026 paper, covered June 15, 2026, splits a neural-network backdoor across software and silicon — the model alone never misclassifies, so software-only scanners like Neural Cleanse and MNTD find nothing.

Provenance defenses for agent graph memory are blind by construction

An arXiv paper dated June 10, 2026 shows provenance checks on LLM graph memory can be bypassed without forging a single source: untrusted structure reroutes which authenticated facts get selected, and information-flow control never sees it.

Agent privacy is a trajectory problem: OCELOT budgets inference leakage at runtime

An arXiv paper dated June 10, 2026 reframes LLM-agent privacy as posterior-risk control: not filtering each output, but budgeting how much an adversary's belief about a secret may improve across a whole trajectory.

Reasoning trace exposure: hiding chain-of-thought doesn't protect it

A May 2026 paper shows that prompting alone can pull a reasoning model's hidden chain-of-thought back into user-visible output — and the recovered traces are good enough to distill a smaller model.

Refusal-escape directions: why alignment can't fully close the jailbreak gap

A May 2026 paper proves aligned LLMs keep 'refusal-escape directions' baked into their operator structure — explaining why jailbreaks persist and why removing them costs utility.

Verified agent skills: capability governance for the SKILL.md supply chain

NVIDIA's May 19, 2026 verified agent skills add risk scanning, cryptographic signing and machine-readable skill cards to the SKILL.md supply chain — a defensive answer to poisoned skills.

SearchLeak (CVE-2026-42824): one click turns M365 Copilot into a data-theft proxy

Varonis disclosed the mechanics of CVE-2026-42824 on June 15, 2026: a crafted microsoft.com link chains prompt injection, an HTML render race and a Bing SSRF to exfiltrate mail and MFA codes. Patched server-side.

Parallax: putting agent safety in the architecture, not the prompt

A position paper published April 14, 2026 argues prompt-level guardrails fail the moment an agent's reasoning is compromised, and proposes structurally separating the part that thinks from the part that acts.

Cross-App Context Poisoning: a rogue ChatGPT app can steer the others

A June 2026 arXiv study shows a malicious ChatGPT app can write into the chat context shared by every connected app through first-party APIs, turning the model into a confused deputy against benign apps.

Disclosure at machine speed: lessons from the first AI vulnerability ledger

Anthropic's coordinated-disclosure ledger, analysed by VulnCheck on June 9, 2026, shows AI surfacing 23,019 candidate bugs while just 1,596 reached maintainers — a preview of coordinated disclosure under machine-speed discovery.

Architecting secure agents: a plan-and-policy defense against prompt injection

An NVIDIA position paper (March 31, 2026) argues that indirect prompt injection cannot be fixed at the model alone — and proposes a plan-and-policy system architecture that constrains what an agent may observe and decide.

GraphSteal: reconstructing a private knowledge graph from Graph RAG

A paper posted May 27, 2026 shows that black-box queries can turn a Graph RAG system into a structural oracle, rebuilding over 90% of its hidden knowledge graph — entities, relations and all.

Cross-domain multi-agent LLM systems: seven security challenges

A Perspective published June 13, 2026 in npj Artificial Intelligence maps seven security challenges that appear when LLM agents from different organizations collaborate without any shared trust model.

MEntA: membership inference on RAG corpora in five entailment queries

A May 2026 USENIX Security paper shows an attacker can tell whether a document sits in a RAG retrieval corpus with about five plain-language questions — no shadow models, no templated prompts, and it survives current defenses.

When #1 trending is malware: the Open-OSS/privacy-filter Hugging Face typosquat

On May 7, 2026 HiddenLayer found Open-OSS/privacy-filter, a typosquat of OpenAI's model that reached #1 trending on Hugging Face with ~244K downloads in 18 hours before shipping a Rust infostealer.

When a government pulls a model: the Fable 5 / Mythos 5 suspension

On June 12, 2026, a US export-control directive forced Anthropic to disable Claude Fable 5 and Mythos 5 worldwide. The reported trigger was a 'jailbreak' that amounts to asking a model to read code and fix flaws — a capability defenders use daily.

XL-SafetyBench: testing LLM safety across 10 countries, not just English

A May 7, 2026 arXiv paper from AIM Intelligence and Microsoft's AI Red Team shows English-centric safety tests miss country-specific harms — and that many models' 'safety' is refusal by accident, not genuine alignment.

MalSkillBench: we can't measure malicious-skill detectors because the test data is biased

A June 2026 paper builds the first runtime-verified benchmark of malicious agent skills — 3,944 samples across 108 attack cells — and shows a single detector's recall can swing 66 points depending on which dataset you test it on.

Why prompt-injection detectors keep failing: the evasion problem in 2026

From keyword classifiers to activation-based drift probes, prompt-injection detectors share one weakness: an adaptive attacker. Two studies report up to ~100% evasion. Treat detection as one layer, never the boundary.

LLM privacy isn't one risk: what an ablation study tells you to fix first

A May 2026 study measures membership inference, attribute inference, data extraction and backdoors under one threat model. The finding: leakage is driven by your design choices — scale, data duplication, RAG config — not by the attack alone.

TOCTOU in AI agents: atomicity violations between observation and action

An old operating-systems bug class resurfaces in agents: the world changes between when an agent looks and when it acts. New 2026 research formalizes it for GUI, browser, and multi-agent systems.

Injection depth in ReAct agents: position beats wording

A June 2026 study of tool-calling ReAct agents finds injection depth—not rhetoric—drives indirect prompt injection: success falls from 60% at the first tool call to 0% by the fourth.

Confidential Computing for Agentic AI: what enclaves can't protect

A May 2026 survey maps confidential computing onto the agentic stack — hardware enclaves can shield agent memory and KV caches from a malicious cloud operator, but they cannot stop prompt injection.

Splunk MCP Server logs auth tokens in clear text (CVE-2026-20205)

Splunk's MCP Server app wrote users' session and authorization tokens unmasked into the _internal index — a CWE-532 secrets-in-logs flaw that turns log access into token theft. Fixed in app v1.0.3.

DNS rebinding turns localhost MCP servers into a remote attack surface

A coordinated 2025–2026 disclosure wave hit every major MCP SDK over one root cause: HTTP servers on localhost that skip Host/Origin validation. The latest, CVE-2026-11624 in Google's MCP Toolbox (June 13, 2026), is rated Critical 9.4.

Why jailbreaks transfer between models — and how salting fights back

A study of 20 open-weight models finds jailbreak transfer comes from shared internal representations, not safety-training quirks. A defense called LLM salting rotates the refusal direction to break reuse.

A safe model is not a safe agent: lessons from the ClawSafety benchmark

An April 2026 benchmark runs 2,520 sandboxed trials on personal AI agents and finds attack success rates of 40–75%. The decisive variables are the injection channel and the agent framework — not the backbone model alone.

ktransformers: unauthenticated RCE via pickle over ZeroMQ (CVE-2026-26210)

A critical RCE in the ktransformers inference engine exposes a ZMQ socket on all interfaces and pickle-loads whatever it receives. It is the latest case of the 'ShadowMQ' pattern copied across AI serving stacks.

CVE-2026-46519: when an MCP server filters tools at display but not at execution

mcp-server-kubernetes enforced its read-only and allow-list controls only in tools/list, never in tools/call. Any client that knew a tool name could run it. A clean lesson in presentation-layer vs execution-layer authorization.

CRCP: RAG corpus poisoning that survives chunking and reranking

A June 9, 2026 arXiv paper shows many corpus-poisoning attacks quietly fail after reranking — and proposes CRCP, a chunk-aware variant built to survive realistic multi-stage RAG pipelines. The lesson is about how you evaluate, not just how you defend.

Cyber Defense Benchmark: frontier LLMs flunk open-ended threat hunting

An April 2026 benchmark drops five frontier models into raw Windows logs and asks them to hunt. The best finds 3.8% of malicious events — none clears the bar for unsupervised SOC work.

Malicious LLM API routers: the unguarded man-in-the-middle for agents

A UC Santa Barbara study (arXiv, April 9, 2026) measured 428 third-party LLM API routers and found dozens injecting code, stealing credentials and draining a crypto wallet — all from a trust boundary developers configure voluntarily.

Flowise CVE-2026-41264: LLM-written pandas code that escalates to RCE

A prompt injection in Flowise's CSV Agent makes the model emit Python that escapes a regex denylist and runs OS commands. Disclosed April 15, 2026 and patched in 3.1.0.

SafeHarbor: a hierarchical memory guardrail that targets agent over-refusal

Accepted at ICML 2026, SafeHarbor is a training-free guardrail that injects context-aware safety rules from a self-evolving risk tree — keeping 63.6% benign utility on GPT-4o while refusing over 93% of attacks.

SEC-bench Pro: how well can AI agents really hunt bugs in V8 and SpiderMonkey?

A May 26, 2026 benchmark measures coding agents on long-horizon vulnerability discovery in real browser engines. Frontier models stay below 40% — and the gap matters for both attackers and defenders.

Prompt injection is unsolved — so contain it at machine speed

At Infosecurity Europe 2026, OWASP's Ariel Fogel called prompt injection an unresolved architectural problem and argued defenders must shift from prevention to runtime containment that runs as fast as the agent.

SecureClaw: a dual-boundary defense for tool-using LLM agents

A June 2026 paper proposes guarding two distinct boundaries at once — authorizing external actions at the effect sink and confining plaintext at the read boundary — reporting 0% attack success on one agent benchmark.

2026-06-14//6 min

Multi-clip video jailbreaks: why video inputs break multimodal LLM safety

A June 2026 ACL paper shows the video channel is a weaker safety boundary than images: attack success climbs as a video is split into more diverse short clips.

2026-06-14//6 min

SIGIL: proving your text was in an LLM's training set

A June 2026 arXiv paper proposes embedding imperceptible canaries into text and code so content owners can prove, with controlled false-positive rates, that a model was trained on their data.

ConVerse: when two agents talk, the stronger one leaks more

A benchmark for agent-to-agent conversations finds privacy attacks succeed up to 88% of the time and security breaches up to 60% — and that more capable models leak more, not less.

Brain-prompt injection: when neural signals become an agent's authorization channel

A June 8, 2026 arXiv paper names a new attack surface: BCI-to-agent pipelines that turn decoded EEG into a tool-use authorization channel. Three injection vectors flip the routed action while EEG- and text-side monitors stay blind.

PI-Hunter: auditing agents to expose and localize hidden prompt injections

A June 2026 paper from Google researchers reframes prompt-injection red-teaming as auditing — PI-Hunter evolves source-aware test cases to surface where latent injections enter and propagate through an agent, not just whether an attack lands.

Claude Code GitHub Action: how the Read tool leaked CI/CD secrets

Microsoft Threat Intelligence found that Claude Code Action's Read tool bypassed the Bash env scrub to reach /proc/self/environ, leaking the runner's ANTHROPIC_API_KEY. Patched in v2.1.128.

Exposed MCP Servers Become Cloud Takeover Pivots

Command injection in cloud MCP servers (CVE-2026-5058/5059) lets attackers reach the instance metadata service, steal the IAM role, and pivot into the whole cloud account.

OWASP State of Agentic AI Security 2026: prompt injection ties most agent failures together

OWASP's State of Agentic AI Security and Governance v2.01 (June 1, 2026) moves from hypothetical threats to documented CVEs and breaches. Prompt injection now maps to six of the ten agentic risk categories.

Credential leakage in LLM agent skills: a 17,000-skill empirical study

An April 3, 2026 arXiv study analyzed 17,022 agent skills and found 520 leaking credentials — 73.5% of the leaks flow through debug logging that pipes secrets straight into the model's context.

Beyond tool poisoning: what a malicious remote MCP server can actually do

A May 21, 2026 study maps the full threat surface of malicious remote MCP servers across ChatGPT, Claude Desktop and Gemini CLI — finding host filtering swings from 95% to 50% on the same request, and successful attacks are almost never disclosed.

Tool stream injection: why static agent defenses break, and what verify-before-commit fixes

A January 2026 paper, VIGIL, reframes indirect injection around the tool stream — forged tool descriptions and fake error messages — and shows that the better-aligned an agent is, the more it obeys them.

Inside GitHub Agentic Workflows: a security architecture for CI/CD agents

GitHub Agentic Workflows reached public preview on June 11, 2026 with a security-first design: zero-secret agents in a chroot jail, a workflow firewall, staged-and-vetted writes, and a threat-detection job. The defensive answer to prompt injection in CI/CD.

Prompt inversion: split LLM inference leaks prompts, a principled defense lands

Prompt inversion attacks recover up to 88.4% of input tokens from intermediate activations in collaborative LLM inference. A paper submitted June 10, 2026 proposes the first information-theoretic defense.

Multimodal input as attack surface: vLLM's video-decoder RCE (CVE-2026-22778)

CVE-2026-22778 turns a malicious video URL into remote code execution on vLLM servers, chaining a PIL info leak with an FFmpeg JPEG2000 heap overflow. Patched in 0.14.1.

TRUSTDESC: deriving tool descriptions from code to defuse tool poisoning

An April 2026 paper attacks tool poisoning at its root: generate a tool's description from its implementation instead of trusting the author-supplied text, neutralising implicit poisoning that detectors miss.

Newer isn't always safer: non-monotonic safety alignment across model generations

A May 2026 paper red-teaming four Gemma generations found the mid-generation model was far easier to jailbreak than both its predecessor and successor — safety doesn't improve in a straight line.

RTK (CVE-2026-45792): untrusted filter configs hide backdoors from AI review

Pillar Security disclosed on May 20, 2026 a flaw in RTK, a token-optimisation filter for Claude Code: a repo-supplied .rtk/filters.toml could silently strip a backdoor from command output before the model ever saw it. The target is the agent's perception, not its execution.

Causality laundering: when a blocked tool call still leaks data

An April 2026 paper shows that denying an agent's tool call is not the end of the attack: the denial itself is an information channel. Flat taint tracking misses it.

GOVERNANCE LOW NEW

DeepMind and partners open a $10M multi-agent AI safety research fund

On June 11, 2026, Google DeepMind, Schmidt Sciences, the Cooperative AI Foundation and ARIA opened a $10M call to build a research field around the safety of millions of interacting AI agents.

The Recuse Signal: a robots.txt for agents that hold real credentials

A June 2026 paper proposes an in-band 'deny' signal — emitted over an SSH banner or a PostgreSQL NOTICE — that politely asks an autonomous agent to withdraw. In a pilot it induced 100% recusal, but an authorization framing flipped the strongest model right back.

CodeSpear: when grammar-constrained decoding becomes a jailbreak surface

A June 10, 2026 arXiv paper shows that the reliability feature forcing LLM code output to be syntactically valid can itself be turned into a jailbreak. Applying a benign code grammar can bypass refusals; the authors' CodeShield defense answers with honeypot code.

The Defense Trilemma: why prompt-injection wrappers can't be complete

A Lean 4-verified April 2026 proof shows no continuous, utility-preserving input wrapper can block every prompt injection. Continuity, utility, and completeness cannot all hold at once.

Mnemonic sovereignty: securing the whole memory lifecycle of agents

An April 2026 survey reframes LLM-agent memory security as a six-phase lifecycle and shows the field ignores forgetting, confidentiality and non-adversarial drift.

Injection keeps leaking Copilot: two new June 2026 disclosure CVEs

June 9, 2026 Patch Tuesday shipped CVE-2026-42824 and CVE-2026-47644 — two injection-class information-disclosure flaws in Microsoft's Copilot surface, continuing the exfiltration lineage that started with EchoLeak.

ChromaToast: a pre-auth RCE in the ChromaDB vector database

HiddenLayer's May 18, 2026 disclosure (CVE-2026-45829, CVSS 10.0) shows ChromaDB's Python server loads an attacker's HuggingFace model and runs its code before it ever checks authentication.

DACSI: when retrieved documents fake the system's control signals

A June 8, 2026 paper names a quiet RAG failure mode: untrusted document text impersonating metadata, provenance and policy signals. No 'ignore previous instructions' required — the lesson is that document-authored labels are data, not policy.

AgentDyn: why injection defenses that ace static benchmarks fail in the wild

A February 2026 ICML benchmark, AgentDyn, runs ten leading prompt-injection defenses on dynamic, open-ended agent tasks. Almost all are either insecure or over-defend into uselessness.

StakeBench: who actually pays when a web agent gets injected?

A stakeholder-centric benchmark from NTU, IBM Research and UIUC shows web agents fail every injection objective tested — and that the harm often lands on third parties, not the user.

Hades worm: poisoned AI coding-tool config that runs on repo open

The Hades supply-chain worm commits config files for Claude Code, Gemini, Cursor, and VS Code that execute on session start or folder open — turning a cloned repo into a credential stealer with no install step.

2026-06-11//7 min

The Injection Paradox: when a prompt injection backfires and erases a brand in RAG

A June 8, 2026 arXiv preprint shows prompt injections in retrieved documents can backfire in safety-trained Claude models, dropping a brand from a 54% to 0% recommendation rate — opening a reverse-attack against competitors.

Context-Fractured Decomposition: jailbreaks through artifact provenance gaps

A June 8, 2026 arXiv paper formalizes the 'provenance gap' in tool-using agents: harmful behavior assembled from individually innocuous tool actions across time, lifting jailbreak success up to 28.3 points.

OWASP's agentic maturity model: don't run in the red cells

OWASP's June 2026 State of Agentic AI report adds an Enterprise Adoption Maturity Model — a two-axis grid where agent autonomy outruns governance, leaving 'red cells' no one can see into.

SABER: coding agents fail operational safety even when they refuse bad prompts

A May 31, 2026 benchmark scores LLM coding agents on the final state of a real workspace, not on prompt refusal. Even the best model leaves a harmful violation in over half of runs.

Cursor allowlist bypass: shell built-ins poison the environment for RCE

CVE-2026-22708 lets a prompt injection use trusted shell built-ins like export and typeset to poison environment variables in Cursor, turning an approved git or python command into remote code execution. Patched in 2.3.

Oversight has a capacity: when more agent approvals make you less safe

A June 8, 2026 arXiv paper models the human reviewer behind an agent's approval gate as a fatiguing, finite resource — and shows that escalating more actions can lower realized safety and open a flooding attack.

2026-06-11//7 min

HPAA: typography humans read but moderation LLMs miss

A June 8, 2026 paper introduces Human-Perceptible Adversarial Attacks — harmful text that stays obvious to a human reader but slips past LLM content moderation through typographic manipulation.

2026-06-11//5 min

Web chatbot plugins: how insecure widgets amplify prompt injection

An IEEE S&P 2026 study of 17 chatbot plugins on 10,000+ sites found forgeable conversation histories (3-8x stronger injections) and web-scraping tools that mix trusted and untrusted content.

AuditBench: LLMs investigating real attacks are false-positive machines

A June 2026 benchmark tests five frontier LLMs on real audit-log investigations. Verdict: overly suspicious models, many false positives — and smaller models often match the big ones.

CASA: task-based access control that checks tool calls against the user's real intent

A May 4, 2026 arXiv paper proposes Continuous Agent Semantic Authorization — a zero-trust layer that extracts a user's task from a multi-turn chat and denies tool calls that don't match it.

LiteLLM CVE-2026-42271: MCP test endpoints chain to unauthenticated RCE

Disclosed in April as an authenticated command injection, LiteLLM's MCP preview endpoints became unauthenticated RCE once chained with Starlette's BadHost bypass — CISA added it to KEV on June 8, 2026.

2026-06-10//6 min

Memory Control Flow Attacks: when stored memory steers an agent's tools

A March 2026 paper shows poisoned agent memory doesn't just corrupt content — it hijacks the control flow of tool selection, forcing unintended tools and skipped steps in over 90% of trials, across tasks and long after injection.

2026-06-10//7 min

Transformers config injection: silent RCE that walks past trust_remote_code

CVE-2026-4372, disclosed June 4, 2026, lets a single config.json field run attacker code on a routine from_pretrained() call — bypassing trust_remote_code=False in Hugging Face Transformers.

2026-06-10//7 min

ADR: detection and response for MCP agents, proven at Uber scale

A May 2026 paper from Uber describes a production EDR-style system for MCP agents: full causal telemetry, two-tier detection, and offline red-teaming, running on 7,200+ hosts for ten months.

Forgotten but recoverable: why LLM machine unlearning keeps leaking back

Multiple 2025-2026 papers show 'unlearned' knowledge in LLMs is routinely recoverable — via quantization, adversarial prompting, and now reasoning traces. Treating unlearning as erasure is a mistake.

2026-06-08//7 min

ePCA: replacing semantic agent guardrails with formal verification

A May 2026 paper proposes ePCA, a guardrail that compiles each agent action into first-order logic and runs an SMT check before execution, blocking unsafe steps as logical deadlocks.

Remote MCP servers: 40% unauthenticated, OAuth broken on the rest

A May 2026 arXiv study scanned 7,973 live remote MCP servers: 40.55% expose tools with no authentication, and all 119 OAuth-enabled servers tested carried at least one flaw — 9 CVEs assigned.

Why benchmarking security agents is hard

A position paper published May 21, 2026 argues that the leaderboards used to score security agents are quietly broken: the adversarial reasoning you want to measure can also break the benchmark itself. Three failure modes, and how to evaluate honestly.

AgentTrust: vetting agent tool calls before they execute

A preprint from May 6, 2026 introduces AgentTrust, a runtime layer that vets each agent tool call before it runs and returns allow/warn/block/review — catching obfuscated shell payloads static guards miss.

Catching model extraction by watching the whole traffic window, not single queries

A June 2026 paper shows a simple distribution test (MMD over query embeddings, calibrated on benign traffic only) detects LLM model-extraction campaigns hidden in mixed API traffic — 0.3% false positives, 100% on pure-attacker streams.

Agent Security Is a Systems Problem: Treat the Model as Untrusted

A May 2026 position paper from Google, UCSD and UW–Madison argues agent security must move out of the model and into the system: treat the LLM as an untrusted component and enforce invariants around it.

2026-06-08//8 min

Sequential data poisoning: splitting a backdoor across post-training stages

A June 3, 2026 paper shows that poison spread across SFT and preference data — negligible at each stage alone — combines into a working backdoor. Per-stage audits create a 'single-attacker illusion'.

Five attacks on x402: when AI agents pay, the cross-layer seams leak

A May 12, 2026 paper formally breaks x402, the HTTP 402 agentic payment protocol. Five attacks across settlement, replay, web handling and discovery — one replayed payment yielded 248 grants on a live endpoint.

How agentic AI compresses the cyber attack lifecycle

A May 2026 arXiv paper models how agentic AI lowers the cost of every attack stage — from reconnaissance to post-compromise — compressing the kill chain and shifting defensive priorities for enterprises.

Why independent AI-agent developers keep missing security risks

A June 2026 arXiv study of independent AI-agent developers finds a user-centric blind spot: builders focus on harmful-content safety while overlooking prompt injection, data exfiltration, and cross-border privacy.

SlotGCG: adversarial token position, not just content, drives jailbreaks

A June 2026 paper shows GCG-style jailbreaks get ~14% stronger when adversarial tokens are placed at attention-correlated slots inside the prompt — and keep 42% more success under input filtering.

MS-Agent's shell tool: a regex denylist turns prompt injection into RCE

CVE-2026-2256 lets attacker-controlled content steer ModelScope's MS-Agent into running OS commands. The root cause is a familiar anti-pattern: guarding a shell tool with a regex denylist instead of an allowlist.

OWASP ASI02: when an agent turns its own tools against you

Tool Misuse & Exploitation is the #2 risk in OWASP's Top 10 for Agentic Applications 2026. The danger isn't an agent gaining new tools — it's misusing the ones it already holds, via over-privilege, poisoned descriptors, or unsafe chaining.

Hands-free firmware VR: an LLM agent reverse-engineers an OT intercom end-to-end

On June 2, 2026, Claroty Team82 ran Claude Opus 4.6 with a Ghidra MCP server against a Zenitel intercom firmware image and re-found a set of known CVEs in under ten minutes — a preview of commoditized firmware vulnerability research.

Beyond shallow safety: mid-sequence injection still flips aligned LLMs

A June 3, 2026 arXiv paper shows safety alignment can be redirected not just at the first tokens but at any generation step — and a model's hidden-state refusal directions don't predict its robustness.

Need to Know: contextual-integrity query rewriting for LLM delegation

A June 2, 2026 arXiv paper recasts privacy-preserving query rewriting as a contextual-integrity problem: forward a span to a cloud LLM only if the task needs it, not because a PII type matched.

Membrane: contrastive safety memory that adapts guardrails without retraining

A June 4, 2026 arXiv paper proposes Membrane, a self-evolving guardrail that pairs each blocked attack with a near-identical benign request, cutting over-refusal to 7-14% while topping F1 on six jailbreaks.

OpenAI Lockdown Mode: cutting the exfiltration leg of prompt injection

On June 6, 2026 OpenAI extended Lockdown Mode to personal and self-serve Business ChatGPT accounts: a deterministic setting that disables outbound paths attackers use to exfiltrate data via prompt injection.

Decision Hijacking: prompt-injecting the LLM that ranks your search results

A growing body of 2025-2026 research shows that when an LLM re-ranks search or RAG candidates, a few injected lines inside one document can force it to the top — collapsing ranking quality by 60+ NDCG points, with stronger models more vulnerable, not less.

THRD: a training-free temporal defense against multi-turn jailbreaks

A June 2026 paper argues multi-turn jailbreaks must be judged across the whole conversation, not turn by turn. THRD scores accumulated risk over time and cuts attack success to 0.2–4% without retraining.

MetaBackdoor: a length-based backdoor trigger that leaves no trace in the input

A May 2026 paper from Microsoft and Institute of Science Tokyo plants a backdoor whose trigger is the input's length, not its text. The prompt looks clean, content filters see nothing, and 90 poisoned examples are enough.

Langflow's public build endpoint: unauthenticated RCE weaponised in 20 hours

CVE-2026-33017 turns Langflow's public flow-build endpoint into unauthenticated remote code execution. Disclosed March 17, 2026, it was exploited in the wild within 20 hours — before any public PoC existed.

Two methodology traps that inflate prompt-injection detector scores

A June 1, 2026 arXiv preprint shows most prompt-injection and jailbreak detector benchmarks lean on per-dataset threshold tuning and undisclosed operating points — two habits that quietly inflate the accuracy you buy.

AgentVisor: an OS-hypervisor pattern that audits every agent tool call

An April 27, 2026 arXiv paper borrows the OS hypervisor idea to defend tool-using LLM agents: a trusted 'visor' audits every tool call and is architecturally blind to untrusted content.

Microsoft's agentic failure-mode taxonomy v2.0: zero-click human-in-the-loop bypass

Microsoft's AI Red Team v2.0 taxonomy (June 4, 2026) adds seven agentic failure modes and reports human-in-the-loop bypass as the most consistently exploited — including zero-click chains from a single external input.

Back-Reveal: data exfiltration through a backdoored agent's own tool calls

A finetuned agent carries a hidden trigger. On a benign cue it reads your session memory and ships it out disguised as an ordinary retrieval call — no prompt injection, no malicious tool. Paper dated April 7, 2026.

VIPER-MCP: 67 CVEs from taint-style flaws across 40,000 MCP servers

A May 20, 2026 arXiv paper audited 39,884 open-source MCP server repos, confirmed 106 zero-days end-to-end and got 67 CVE IDs assigned. The story is the pattern: untrusted agent input reaching shell, network and file-system sinks.

Optimus: scoring jailbreaks beyond pass/fail reveals a stealth-optimal regime

A May 9, 2026 arXiv paper argues binary attack-success-rate hides the jailbreaks defenders should fear most. Its Optimus metric scores prompts on similarity and harmfulness, exposing a 'stealth-optimal' band where ASR collapses to zero.

2026-06-05//7 min

No two labs measure prompt injection the same way

A June 1, 2026 comparison of the prompt-injection disclosures from Anthropic, OpenAI, Google and Meta found that no two labs share a metric, a surface, or a definition of success — so vendor numbers cannot be compared.

AgentRedBench: indirect injection in SaaS agents is an authorization gap

AgentRedBench (June 2026) red-teams LLM agents reading from SaaS tools like Gmail and Jira. No-guard attack success ran 32–81% across eight frontier models, until a tool-response classifier cut it.

2026-06-05//7 min

Adaptive AI worms: when malware runs its own local LLM

A June 2026 University of Toronto paper demos a worm that runs open-weight LLMs on the machines it compromises, adapting its exploit per target and weaponising advisories published after the model's training cutoff.

2026-06-05//7 min

CVE-2026-45497: command injection turns Microsoft 365 Copilot into an RCE path

On June 4 2026 MSRC disclosed CVE-2026-45497, a command-injection flaw in Microsoft 365 Copilot rated as remote code execution with a scope change across the service boundary. Fixed server-side.

trust_remote_code=False isn't a boundary: vLLM's recurring model-load RCE

CVE-2026-27893 (disclosed March 27, 2026) is vLLM's third trust_remote_code bypass. Two model files hardcode trust_remote_code=True, silently overriding an operator's opt-out and enabling RCE from a malicious model repo.

When an MCP tool argument becomes an Android intent: mobile-mcp's injection sinks

CVE-2026-35394 lets a model-controlled URL fire arbitrary Android intents through mobile-mcp's mobile_open_url tool. Paired with a sibling path-traversal CVE, it shows a pattern: MCP tool arguments flowing unvalidated into platform sinks.

The agent that writes its own logs: why self-reported agent audit trails can't be trusted

If a compromised agent produces its own activity log, it can omit, alter, or fabricate what it did. Three June 2026 efforts — arXiv's Notarized Agents, an IETF agent-audit-trail draft, and SCITT — converge on the same fix: move the trust boundary off the agent.

GGUF model files are untrusted input: llama.cpp's recurring parser RCEs

CVE-2026-33298 (March 2026) and a May 15, 2026 oss-sec disclosure show llama.cpp's GGUF parser keeps hitting integer-overflow heap corruption: loading a crafted model file can mean RCE.

MPBench: a systematic taxonomy of memory poisoning in LLM agents

A June 3, 2026 arXiv study maps four memory write channels, nine structural weaknesses and six attack classes — and shows prompt-injection defenses don't cover memory poisoning.

When embedding-based defenses fail in LLM multi-agent systems

A May 1, 2026 arXiv paper shows that detectors which prune malicious agents by message embedding collapse when attackers craft near-benign text — and proposes token-confidence signals as a more robust replacement.

AGENTS.md injection: a poisoned dependency can silently rewrite your coding agent's orders

An April 20, 2026 NVIDIA AI Red Team report shows a malicious dependency can drop a crafted AGENTS.md at build time, override the developer's prompt, and instruct OpenAI Codex to hide the change from the pull request.

Social contagion: LLM agents leak private data in multi-agent settings

A May 2026 study simulating thousands of LLM agents finds privacy leakage is socially contagious: agents leak ~8x more after a peer does, and explicit privacy instructions reduce but don't eliminate it.

Self-propagating agent worms and the temporal re-entry defense

A May 2026 paper formalizes how persistent agent state lets a prompt-injection payload write itself back into the LLM context, propagate across agents zero-click, and proposes RTW-A — a defense proven under a No Persistent Worm Propagation theorem.

PISmith: adaptive RL red-teaming keeps breaking injection defenses

A March 2026 paper trains an attacker model with reinforcement learning to stress-test prompt-injection defenses in a black-box setting — and 8 state-of-the-art defenses still fall, including on AgentDojo and InjecAgent.

SGLang's ZMQ broker: unauthenticated RCE via pickle deserialization

Three CVEs disclosed March 12, 2026 turn SGLang's pickle.loads() calls into unauthenticated remote code execution. The fix landed in v0.5.10 — but the real lesson is that pickle on a network socket is RCE by design.

Tool poisoning across 7 MCP clients: a comparative security posture

A March 2026 empirical study tests four tool-poisoning attacks against Claude Desktop, Claude Code, Cursor, Cline, Continue, Gemini CLI and Langflow — and finds most protection comes from the model, not the client.

Description poisoning: the agent channel your benchmarks don't test

A May 2026 AWS Bedrock AgentCore demo and a June 2026 arXiv paper converge on the same blind spot: tool descriptions, read before every call, are an injection channel that infra controls and single-number benchmarks both miss.

Hybrid BM25 + vector retrieval cut gradient-guided RAG poisoning from 38% to 0%

A March 10, 2026 arXiv preprint shows that adding sparse BM25 alongside dense retrieval blocks an entire class of gradient-optimized RAG corpus poisoning — without touching the LLM.

AgentShield: catching compromised agents with honeytokens and decoy tools

A May 2026 paper turns deception engineering on tool-using LLM agents: fake tools, fake credentials, and parameter allowlists that a hijacked agent trips over. It reports 90.7–100% detection of successful attacks with zero false alarms.

OWASP Agent Memory Guard: a runtime layer against agent memory poisoning

Covered by Help Net Security on June 1, 2026, OWASP's Agent Memory Guard is the first reference implementation for ASI06 — a drop-in layer that screens every agent memory read and write against a YAML policy.

Catching credential exfiltration in LLM agents before the output token

Published June 2, 2026, an arXiv paper detects agent credential leaks before any output token is emitted — combining activation probes, calibrated honeytokens, and multi-turn leakage accounting.

AI threat actors mapped to MITRE ATT&CK: the ARiES score and what it breaks

Anthropic's June 3, 2026 report maps a year of AI-enabled cyberattacks to MITRE ATT&CK. The finding for defenders: sophistication, technique count and interface no longer predict an actor's risk — orchestration does.

AIRQ scores 100 production AI agents: 98% carry the lethal trifecta

Adversa AI's June 2026 AI Risk Quadrant rates 100 commercial agents on attack surface, blast radius and defenses. Only 11% are well-defended; tool execution alone explains 76% of blast radius.

US AI security executive order: a vulnerability clearinghouse and frontier review

Signed June 2, 2026, the US executive order on AI innovation and security creates a federal AI vulnerability clearinghouse and a voluntary 30-day pre-release review of 'covered frontier models'.

CVE-2026-30615: prompt injection rewrites Windsurf's MCP config into RCE

OX Security's April 15, 2026 advisory shows how attacker-controlled content can make the Windsurf IDE register a malicious MCP STDIO server and run commands — with no user click. The class spans coding agents, but Windsurf got the CVE.

Opus 4.8's system card puts a number on browser-agent prompt injection: 31.5%

Anthropic's May 28, 2026 Claude Opus 4.8 system card reports a 31.5% pre-safeguard hijack rate for its browser agent — the only concrete prompt-injection metric a frontier lab published this spring.

Agent Threat Rules: a "Sigma for AI agents" — and what its recall numbers admit

ATR ships open YAML detection rules for agent attacks, now running at Microsoft, Cisco and Gen Digital. Its own benchmarks show why regex detection is a layer, not a perimeter.

ChatInject: forging chat-template role tags to bypass the instruction hierarchy

An ICLR 2026 paper shows that wrapping an indirect-injection payload in a model's own chat-template tokens forges a higher-priority role, lifting attack success from 5% to 32% on AgentDojo and to 52% with multi-turn.

2026-06-03//7 min

ASPI: asking the user to clarify widens the injection surface

A May 17, 2026 arXiv benchmark shows that when an agent pauses to ask the user for clarification, prompt-injection success climbs from under 2% to over 34% on o3 and Gemini-3-Flash.

SnapGuard: catching prompt injection in what the agent sees, not what it parses

An April 2026 paper proposes a lightweight detector for screenshot-based web agents, where text-centric guards are blind. It reads the rendered pixels — gradient stability plus polarity-reversed text — at 1.81s per page.

CyBiasBench: offensive LLM agents keep picking the same attacks

A May 2026 benchmark logged 630 attack sessions and found that LLM agents in offensive cyber scenarios fixate on a narrow set of attack families — regardless of how you prompt them. Bias, not skill, shapes what they try.

Authorization propagation: the agent security gap prompt-injection fixes won't close

A May 6, 2026 paper by Krti Tallam argues multi-agent systems have a distinct authorization-propagation problem — transitive delegation, aggregation inference, temporal validity — that survives even a perfect prompt-injection defense.

2026-06-03//7 min

Goal reframing: the one prompt feature that makes LLM agents exploit planted bugs

An April 6, 2026 arXiv study ran ~10,000 agent trials across seven models. Most 'manipulation' tactics did nothing — only goal reframing, like 'you are solving a puzzle', reliably pushed agents to exploit a planted bug.

CAESAR: coordinated LLM agents beat the single-model reasoning ceiling

A May 9, 2026 arXiv paper shows that splitting an LLM attacker into five typed roles outperforms a single agent on 25 CTF tasks across four models — the gain comes from coordination structure, not raw capability.

ClawTrojan: stored prompt injection becomes a persistent agent backdoor

A May 29, 2026 arXiv paper shows injection hidden in a file can be stored by a local agent and run later — reaching 95.5% attack success where single-turn injection scores near zero.

DataShield: when benign fine-tuning quietly erodes a model's safety

A May 29, 2026 arXiv paper shows fine-tuning an aligned LLM on harmless data still degrades its safety, and proposes DataShield to flag the samples responsible before training.

Langroid SQLChatAgent: prompt-to-SQL injection escalates to RCE (CVE-2026-25879)

Disclosed June 1, 2026, CVE-2026-25879 (CVSS 9.8) lets a prompt-injected SQL agent run dialect-specific primitives like COPY FROM PROGRAM, turning a chat box into code execution on the database host.

Just ask the bot: Meta's AI support assistant and the Instagram takeovers

Over the May 30–31, 2026 weekend, attackers hijacked high-profile Instagram accounts by asking Meta's AI support bot to relink an account email. No prompt injection required — only excessive agency.

Brittle agents: indirect injection survives multi-step tool calls

An April 4, 2026 paper tests 6 defenses against 4 indirect-injection vectors across 9 LLM backbones in multi-step agents — advanced injections bypass nearly all of them, and some surface mitigations backfire.

Stop fixating on the prompt: hijacking an agent's reasoning and memory

An April 2026 paper, JailAgent, drives an agent to malicious tool calls without touching the user prompt — by perturbing its reasoning trace and memory retrieval instead. The prompt was never the whole attack surface.

Trojan Hippo: dormant agent-memory payloads that exfiltrate your data

A May 3, 2026 arXiv paper shows one crafted email can plant a dormant payload in an agent's long-term memory that wakes only when you later discuss finance or health, then exfiltrates it — up to 100% success.

Stop scoring jailbreak defenses on attack success rate alone

A May 2026 IEEE S&P paper argues that attack success rate — the field's default metric — hides how jailbreak defenses actually behave. Its Security Cube evaluates them across several axes at once.

LASM: a 7-layer map of where agent attacks outrun their defenses

A 58-page survey revised May 6, 2026 re-organizes agentic AI security by stack layer and timescale across 116 papers. The map shows where attacks are documented but defenses and benchmarks simply do not exist yet.

MCP sampling: how malicious servers abuse the reverse LLM channel

MCP's sampling feature lets a server ask the client's model for completions. Unit 42 showed (Dec 2025) how a malicious server turns that reverse channel into covert tool calls, conversation hijacking, and compute theft.

IPI Arena: a 272k-attack competition finds no agent model immune

Gray Swan's Indirect Prompt Injection Arena, judged with UK AISI and US CAISI, ran 272,000+ attacks against 13 frontier models. Every model was hijacked — and a single universal template broke nine of them.

TrustFall: project MCP settings turn the folder-trust click into RCE

Adversa AI's TrustFall (May 7, 2026) shows four agentic coding CLIs auto-start project-defined MCP servers the moment a developer accepts the folder-trust prompt — one keypress on the dev machine, zero clicks in CI.

LightLLM CVE-2026-26220: pickle on a WebSocket the server forces onto the network

CVE-2026-26220 (disclosed Feb 15, 2026) puts pickle.loads() on two unauthenticated WebSocket endpoints in LightLLM's prefill-decode mode — and the server refuses to bind to localhost, so the surface is always remote.

Dynamic separators: hardening Polymorphic Prompt Assembling against injection

A May 28, 2026 arXiv paper fixes a blast-radius flaw in Polymorphic Prompt Assembling by generating a unique SHA-256 separator per request, cutting one payload's attack success rate from 0.88 to 0.38.

Silent Egress: implicit prompt injection leaks data through URL previews

An eBay study (arXiv, Feb 25, 2026) shows agents that auto-preview URLs can be made to exfiltrate runtime context through tool calls — P(egress)≈0.89, and 95% of leaks leave the visible answer benign.

OFFENSIVE AI CRITICAL NEW

Agent at the wheel: detecting LLM-driven post-exploitation

On May 10, 2026, Sysdig captured its first intrusion where an LLM agent drove the post-exploitation in real time — CVE-2026-39987 on marimo to a full PostgreSQL dump in under an hour. The forensic tell is the command shape.

Flowise CVE-2026-40933: importing a shared chatflow is enough for RCE

Obsidian Security's May 28, 2026 write-up shows how Flowise's Custom MCP node turns a stdio MCP config into server-side code execution — and how merely importing a shared chatflow can trigger it, no save or run required.

Prompt injection in the wild: hidden attacks in LLM resume screening

A USENIX Security 2026 study of 196,682 real resumes found about 1% carry hidden prompt injections — and over 90% are invisible 'data injections', not the explicit instructions current detectors look for.

RED TEAM MEDIUM NEW

Agentic red teaming: when one operator runs 674 attacks in three hours

A May 2026 paper from Dreadnode wraps the AI red-team toolkit in an agent that picks attacks, runs them, and scores results autonomously — compressing weeks into hours. The real story is what that does to your assessment program.

CrewAI: a silent sandbox fallback turns prompt injection into RCE (VU#221883)

Four CrewAI flaws let prompt injection chain into RCE, SSRF and file read via a Code Interpreter that silently drops out of Docker. CERT/CC's May 20, 2026 update confirms the full fix.

The guardrail trade-off triangle: prompt-injection defenses for LLM tutors

A May 2026 benchmark of prompt-injection defenses for educational LLM tutors puts numbers on a hard truth: no single guardrail wins robustness, usability and latency at the same time.

Jailbreaks leave a trace: detecting attacks in LLM internal activations

A February 2026 paper and a March 2026 follow-up show jailbreak prompts carve a distinguishable signature into a model's hidden activations — enabling inference-time detection without fine-tuning or an auxiliary judge model.

Token-drain attacks: economic denial-of-service via agent tool chains

Two 2026 papers show a malicious tool or skill can steer an LLM agent into long tool-calling loops that multiply token cost 6–658× while still returning the right answer — a stealthy take on OWASP's Unbounded Consumption.

Causal attribution: an emerging defense against indirect prompt injection

A cluster of early-2026 papers — CausalArmor and AttriGuard — defends tool-calling agents by asking which actions are causally driven by untrusted content rather than by the user. A look at the causal-attribution line of defense.

LITMUS: when an agent says no but the file is already deleted

A May 11, 2026 benchmark measures behavioral jailbreaks of LLM agents in real OS environments — and finds that even Claude Sonnet 4.6 executes 40.6% of high-risk operations, sometimes while verbally refusing them.

SIDE CHANNEL MEDIUM NEW

Prompt theft by timing: prefix-cache side channels in multi-tenant LLMs

Shared prefix caching makes LLM APIs faster — and leaks prompts. By timing the first token, an attacker can rebuild another tenant's prompt. A March 2026 paper defends it without killing performance.

AgentSecBench: in an LLM agent, data flow is not authority

Posted May 25, 2026, AgentSecBench formalizes agent security as noninterference and tests six defense classes. The finding: prompt text only describes a boundary, while provenance, capability limits, and output validation enforce one.

AI-authored zero-days: how GTIG fingerprinted the first AI-built exploit

On May 11, 2026, Google's GTIG disclosed the first zero-day it believes was AI-built — a 2FA-bypass script betrayed by a hallucinated CVSS score and textbook docstrings. Here's how to read the tells.

SymJack: one approved file copy becomes RCE in six AI coding agents

Adversa AI disclosed on May 26, 2026 a symlink-hijack pattern that turns a single benign-looking shell copy into a config overwrite and host RCE across Claude Code, Cursor, Gemini, Antigravity, Copilot, Grok Build and Codex CLIs.

2026-05-30//6 min

Slopsquatting in 2026: 127 package names that all five frontier LLMs hallucinate

A May 16, 2026 arXiv replication of the USENIX Security '25 slopsquatting study finds hallucination rates are down across frontier models — but identifies 127 phantom packages that every tested model invents identically, a model-agnostic supply-chain attack surface.

Blindfold: action-level jailbreaks bypass semantic defenses on embodied LLMs

A SenSys '26 paper (May 11–14, 2026) introduces Blindfold, an automated framework that jailbreaks embodied LLMs by decomposing harmful goals into individually benign actions — up to 53% higher attack success than semantic-level baselines on a real 6DoF robotic arm.

MCPwn (CVE-2026-33032): nginx-ui MCP endpoint hands over the web server

An unauthenticated MCP endpoint in nginx-ui ≤ 2.3.3 lets any network attacker rewrite nginx configs and restart the service. CVSS 9.8, publicly disclosed on April 15, 2026, exploited in the wild within hours of the patch.

Measuring LLM exploit capability: ExploitBench, ExploitGym and the SCONE-bench refresh

On May 22, 2026 Anthropic published Mythos Preview results on three new exploitation benchmarks. The numbers — and the way the benchmarks decompose the exploit chain — change how defenders should think about frontier offensive capability.

Proprietary Problems: Cisco's 15-model paired-regime study shows single-turn safety scores miss most multi-turn risk

A May 27, 2026 Cisco study of 15 flagship closed models from OpenAI, Anthropic, Google, Amazon and xAI records multi-turn attack success rates of 7.89% to 88.30% — and cross-regime gaps up to 55 percentage points over single-turn baselines.

One million exposed AI services: what the Intruder scan actually found

On May 5, 2026, Intruder published the results of an internet-wide scan that mapped 1 million exposed AI services across 2 million hosts. The recurring failure is not exotic — it is permissive defaults.

The agent-human security gap: what production ships, what papers study

A May 23, 2026 UCLA paper audits 59 academic studies, 21 production agent systems and 26 security plugins — and finds that the defenses researchers favor have zero production deployment.

The Autonomy Tax: how defense training breaks LLM agents

A March 19, 2026 USC paper measures the cost of prompt-injection-defense training on agent competence — defended models time out on 99% of tasks, vs 13% for undefended baselines.

MCP needs a trust handshake: attested tool-server admission

A May 22, 2026 arXiv paper proposes mcp-attested — a backward-compatible MCP extension that gates tool dispatch on signed clearance, deny-by-default allowlists, and tamper-evident audit logs.

WARD: a co-evolved guard model that holds up against adaptive prompt injection on web agents

A May 14, 2026 NUS paper proposes WARD — a guard model trained against a memory-driven adversarial attacker — and reports near-perfect out-of-distribution recall on web-agent prompt injection.

MemMorph: hijacking tool selection in LLM agents through fluent memory poisoning

A May 24, 2026 arXiv paper from NTU Singapore shows three plausible-looking memory entries can steer an agent toward an attacker-chosen tool with 85.9% success — and survive three off-the-shelf defenses.

SilentRetrieval: fluent RAG corpus poisoning that slips past perplexity filters

A May 27, 2026 arXiv preprint introduces a two-stage attack that hides goal-hijacking triggers inside fluent documents, reaching 57% LLM-attack success on Natural Questions and MS MARCO with one poisoned record per query.

GOVERNANCE MEDIUM

CISA + Five Eyes publish the first joint guidance on agentic-AI adoption

On May 1, 2026, CISA, NSA and the Five Eyes cyber agencies released 'Careful Adoption of Agentic AI Services' — a 5-risk taxonomy and a deployment playbook that critical-infrastructure operators are now expected to fold into their existing cybersecurity frameworks.

Microsoft Copilot Cowork: poisoned skills exfiltrate M365 files with no approval

PromptArmor's May 26, 2026 disclosure shows that a five-line prompt injection inside a Copilot Cowork skill file can leak SharePoint and OneDrive documents through auto-approved Teams messages — no patch closes the design.

MULTIMODAL MEDIUM

CrossMPI: image-only prompt injection steers what VLMs read and see

A May 15, 2026 Xidian University arXiv paper introduces CrossMPI: imperceptible image perturbations that change how vision-language models interpret both the image and the user's text prompt, with 66% average success across five LVLMs.

IterInject: when an LLM optimiser writes its own indirect prompt injections

A May 23, 2026 paper closes the loop between payload, diagnoser and LLM optimiser — lifting indirect-injection ASR from near-zero to 33–90% on InjecAgent and compromising 5 of 9 Claude Code targets.

NSA AISC publishes MCP security design guidance for production AI

On May 20, 2026, NSA's Artificial Intelligence Security Center released a 15-page Cybersecurity Information Sheet on Model Context Protocol — eight classes of weakness, five real-world incidents, nine defensive recommendations.

2026-05-28//8 min

Poisoning the Watchtower: when SOC copilots read attacker-controlled logs

A May 23, 2026 paper formalises log-substrate prompt injection — adversarial content in log fields steering LLM-based SOC assistants. Best defense leaves 11.8% average injection success.

SUPPLY CHAIN MEDIUM

pgAdmin 4 ships an LLM panel and a classic LFI+SSRF arrives with it (CVE-2026-7817)

pgAdmin 4 9.15 patches an authenticated LFI and SSRF in its new LLM API configuration endpoints. The bug class is decades old; the surface is brand new.

Temporal memory contamination: longitudinal safety drift in memory-equipped LLM agents

Three arXiv papers from April and May 2026 converge on a failure mode complementary to memory poisoning — memory-equipped agents drift unsafe as benign context accumulates, with compressed summaries acting as a laundering channel.

The pressure: open-source security teams under the AI-assisted vulnerability flood

On May 26, 2026, curl's Daniel Stenberg published 'The pressure' — more than one credible security report per day, twelve confirmed CVEs in half a release cycle, and a pattern other maintainers are now reporting in parallel.

The agent harness is your real privilege boundary — and most teams draw it in the wrong place

A May 26, 2026 Pillar Security write-up argues the harness — Claude Code, Cursor, Codex — holds the secrets, tools and hooks an agent never sees. Recent harness bugs and CVE-2026-22708 make the case concrete.

Sockpuppeting: a one-line prefill that jailbreaks 11 production LLMs

A line of code injected as the last assistant message coaxes 7 of 10 major models into harmful completions. The fix is not at the model — it is API-side message-order validation.

GrafanaGhost: indirect prompt injection chained with a URL-parse bug to exfiltrate dashboard data

Noma Security's April 7, 2026 disclosure shows how three modest defects — a stored injection point, a startsWith('/') URL check, and a one-word guardrail bypass — combine into a silent exfiltration path through Grafana's AI assistant.

Networks of agents break in new ways: Microsoft's red-team, plus RAMPART and Clarity

Microsoft Research red-teamed an internal platform of 100+ always-on agents. Four attack patterns — propagation, amplification, trust capture, proxy chains — show up only at the network level. RAMPART and Clarity, open-sourced May 20, 2026, are the response.

2026-05-27//8 min

Antigravity find_by_name: when a native tool call jumps over Secure Mode

On April 20, 2026, Pillar Security disclosed that a single unsanitised parameter in Google Antigravity's find_by_name tool turned file search into arbitrary code execution — and bypassed the IDE's strictest sandbox.

OFFENSIVE AI MEDIUM

Apple's May 2026 bulletin formally credits Claude on two macOS CVEs

On May 11, 2026, Apple's macOS Tahoe 26.5 advisory named Claude alongside its researchers on two CVEs — a kernel integer overflow and a WebKit use-after-free. AI-assisted vulnerability research is now in the official changelog.

2026-05-27//6 min

INFRASTRUCTURE CRITICAL

BadHost (CVE-2026-48710): one Host-header character bypasses auth in Starlette, vLLM and FastMCP

X41 D-Sec disclosed on May 22, 2026 a critical auth bypass in Starlette < 1.0.1. A single / ? or # in the HTTP Host header desynchronises the routed path from the path the middleware sees, breaking path-based authorization in vLLM, LiteLLM, FastMCP and thousands of FastAPI-based AI agents.

DATA LEAK CRITICAL

Bleeding Llama: a GGUF parsing flaw leaks Ollama process memory to unauthenticated attackers

CVE-2026-7482, publicly disclosed in May 2026 and codenamed Bleeding Llama by Cyera, lets a remote attacker pull arbitrary chunks of an Ollama server's heap — API keys, system prompts, other users' conversations — with three unauthenticated API calls. The silent patch shipped 2.5 months before the CVE was assigned.

ClaudeBleed: when a browser agent trusts the wrong extension

LayerX disclosed ClaudeBleed on May 6, 2026: a trust-boundary flaw let any Chrome extension drive Claude in Chrome and exfiltrate Gmail, Drive and GitHub data. The first patch was bypassed within hours.

PROMPT INJECTION CRITICAL

Encoded prompt injection: when guardrails fail because the LLM decodes the payload

On May 4, 2026 a tweet written in Morse code drained around $175K from a Grok-controlled crypto wallet. The incident is the most expensive demonstration to date of an old defensive blind spot — string-matching guardrails can't see through encodings that the model itself happily decodes.

OFFENSIVE AI MEDIUM

The first CVE wave: AI-assisted discovery is reshaping disclosure volumes

VulnCheck's May 14, 2026 analysis shows year-to-date CVE issuance up +563% on Chrome, +476% on GitHub, +180% on VMware, +170% on Apache. The systemic shift behind the Apple, Mozilla and ActiveMQ headlines is now visible in the numbers.

PROMPT INJECTION MEDIUM

Font-mapping prompt injection: when peer review becomes an LLM attack surface

A May 25, 2026 arXiv benchmark shows hidden font-mapping payloads can flip LLM peer reviews from reject to accept. ICML 2026 already used the same trick in reverse to desk-reject 497 papers.

MCP STDIO transport: the design choice that became 11 CVEs and 200,000 exposed agents

On April 16, 2026, OX Security disclosed that Anthropic's MCP STDIO transport executes any OS command it is handed. Anthropic called it 'by design'. The cascade has produced eleven downstream CVEs in six weeks.

MultiBreak: 10,389 multi-turn prompts expose how conversational jailbreaks slip past LLM safety

A May 3, 2026 ICML paper releases the largest, most diverse multi-turn jailbreak benchmark to date. It records attack-success-rate gaps of up to 54 points over the previous state of the art on DeepSeek-R1-7B and 34.6 on GPT-4.1-mini — and quantifies how alignment that holds in single turns collapses across follow-ups.

When prompts become shells: prompt injection escalates to RCE in agent frameworks

Two CVEs in Microsoft Semantic Kernel and four in CrewAI — all disclosed in early 2026 — turn a single injected prompt into remote code execution on the host. The pattern is structural, not incidental.

RESEARCH LOW

Teaching Claude Why: how Anthropic drove agentic misalignment to zero

On May 8, 2026, Anthropic's Alignment Science team published a case study showing that teaching Claude to explain its ethical reasoning — not just demonstrate it — cut agentic misalignment from 96% to under 1%.

Poison once, exploit forever: persistent memory poisoning of LLM agents (OWASP ASI06)

An April 2026 arXiv paper on cross-site memory poisoning and a May 13, 2026 OWASP post on the Cisco MemoryTrap finding against Claude Code converge on the same lesson: agent memory is a trust boundary.

Treating AI agents like operating systems: a CISPA blueprint for isolation and privilege

A May 14, 2026 CISPA paper applies decades of OS security thinking to LLM agents. Tested on four OpenClaw-like systems, two weakness classes — cross-user exfiltration and unauthorized network egress — fail in every single one.

OFFENSIVE AI CRITICAL

AI-assisted ICS attack: lessons from the Monterrey water utility intrusion

Dragos' May 2026 report on Servicios de Agua y Drenaje de Monterrey documents the first publicly analysed campaign in which a commercial LLM — Claude — was the primary technical operator of an attempted OT intrusion.

MULTIMODAL CRITICAL

AudioHijack: imperceptible audio hijacks voice agents (IEEE S&P 2026)

An April 16, 2026 IEEE S&P paper introduces auditory prompt injection: adversarial reverb hidden in audio drives 13 large audio-language models and commercial voice agents (Mistral AI, Microsoft Azure) into unauthorized actions with 79-96% success.

INDIRECT INJECTION MEDIUM

Discourse AI XSS (CVE-2026-27740): when LLM output is trusted as HTML

A flagged post, an AI moderator, an htmlSafe call. The Discourse AI plugin treated LLM output as trusted markup, turning indirect prompt injection into Staff-side XSS. Published March 19, 2026.

2026-05-26//6 min

The Lethal Trifecta: when an agent reads private data, untrusted content, and can phone home

Simon Willison's framework for the single architectural mistake that turned 2026's wave of AI-agent data exfiltration vulnerabilities into a class, not a coincidence.

MCP Back-End Vulnerabilities: classic flaws resurface across AI database bridges

Akamai's May 12, 2026 research found SQL injection (CVE-2025-66335), missing authentication, and unsanitised inputs across three MCP servers — Apache Doris, Apache Pinot, and Alibaba RDS. The pattern, not the bugs, is the story.

OFFENSIVE AI MEDIUM

OpenAI Daybreak and GPT-5.5-Cyber: a permissive security model behind a verified-identity gate

Between May 7 and 12, 2026, OpenAI launched Daybreak — a cybersecurity platform built on GPT-5.5, Codex Security and a 'cyber-permissive' sibling, GPT-5.5-Cyber. UK AISI's prior evaluation found a universal jailbreak in six hours.

Project Glasswing: 10,000+ critical bugs found by Claude Mythos in a month

Anthropic's May 26, 2026 update on Project Glasswing reports that ~50 partners have used Claude Mythos Preview to find more than 10,000 high/critical-severity vulnerabilities, including 271 latent bugs patched in Firefox 150 — and lays out a controlled-access model for a frontier offensive capability.

Semantic Kernel: when a prompt becomes a shell (CVE-2026-25592, CVE-2026-26030)

Microsoft disclosed two critical vulnerabilities in Semantic Kernel on May 7, 2026 that turn a single injected prompt into host-level code execution. The root cause is architectural: tool registries and eval() treated as features, not security boundaries.

SUPPLY CHAIN MEDIUM

Hidden triggers in SKILL.md: semantic supply-chain attacks on agent skill registries

A May 12, 2026 University of Maryland paper shows that 20-token additions to a SKILL.md file can make an agent discover and select an adversarial skill in 77–86% of trials, and bypass registry-side scans up to 100% of the time.

Trust No Tool: cognitive poisoning of LLM agents through tool feedback

A May 17, 2026 arXiv paper introduces 'cognitive poisoning' — a malicious tool that wins the agent's trust over many benign-looking turns and only weaponises the final action. The defence target shifts from prompts to trajectory.

ADVERSARIAL MEDIUM

Usability as a Weapon: how feature requests turn coding LLMs insecure

A May 11, 2026 arXiv paper shows that asking a coding LLM for a faster, simpler or feature-richer version of secure code reliably drops the security constraints. UPAttack reaches 98.1% on GPT-5.2-chat and Gemini-3.

Agents Rule of Two: Meta's pragmatic answer to unsolved prompt injection

Published Oct 31, 2025 by Meta and re-adopted in Databricks' May 2026 guide, the Agents Rule of Two limits any agent session to two of three risky properties — the most actionable framework while prompt injection remains unsolved.

Azure SRE Agent: a multi-tenant token check that let strangers watch your incidents (CVE-2026-32173)

Disclosed April 20, 2026, an Entra ID app-registration misconfiguration on Azure SRE Agent's /agentHub WebSocket let any tenant connect, listen to every prompt, reasoning step, CLI command and credential — silently.

CVE-2026-35435: Azure AI Foundry's M365 published agents trusted callers they shouldn't have

Disclosed May 7, 2026 (CVSS 8.6), an improper access-control flaw in Azure AI Foundry let unauthorized attackers elevate privilege through M365 published agents. Microsoft reports active exploitation; mitigations are available before a patch.

Claw Chain: four OpenClaw CVEs that turn an AI agent into the attacker's hands

Disclosed May 15, 2026, Cyera Research's Claw Chain chains four patched OpenClaw flaws — sandbox escape, env-var disclosure, MCP loopback EoP, symlink read escape — into full host takeover via the agent itself.

Comment and Control: one prompt injection pattern, three vendors leaking GitHub Actions secrets

Disclosed April 15, 2026, Comment and Control turns ordinary PR titles, issue bodies and HTML comments into credential-exfiltration channels in Claude Code, Gemini CLI and GitHub Copilot Agent.

Contextual integrity: why prompt-injection defenses keep failing

A May 2026 paper by Abdelnabi and Bagdasarian recasts prompt injection through Contextual Integrity and shows that data-instruction separation is a category mistake.

PROMPT INJECTION CRITICAL

Copirate 365: chaining prompt injection, delayed tool invocation and memory hijack in M365 Copilot (CVE-2026-24299)

Johann Rehberger's DEF CON writeup, published May 2026, walks through a five-stage indirect prompt-injection chain that turns one booby-trapped email into a persistent backdoor inside Microsoft 365 Copilot. Patched, but the patterns are generic.

INDIRECT INJECTION MEDIUM

Indirect prompt injection in the wild: three April 2026 studies converge

Google, Forcepoint and CISPA independently measured indirect prompt injection across the open web in April 2026. The picture: 15K+ validated payloads, 32% growth, organized templates.

INFRASTRUCTURE CRITICAL

LiteLLM CVE-2026-42208: a pre-auth SQL injection in the AI gateway

Disclosed April 20, 2026 and exploited 36 hours after the global advisory dropped, CVE-2026-42208 turns LiteLLM's Authorization header into a direct read on every provider key the proxy fronts.

JAILBREAK MEDIUM

Mathematical encoding jailbreaks: when set theory bypasses LLM safety

An arXiv paper posted on May 5, 2026 shows that re-expressing a harmful prompt as a set-theory or formal-logic problem bypasses safety training on 46–56% of attempts across eight frontier models — but only when a helper LLM does the reformulation, not when mathematical syntax is bolted on top.

When the attacker is another LLM: large reasoning models as autonomous jailbreakers

A Nature Communications paper formalised in May 2026 shows four reasoning models — DeepSeek-R1, Gemini 2.5 Flash, Grok 3 Mini and Qwen3 235B — jailbreaking nine target LLMs with a 97.14% overall success rate, armed with nothing but a single system prompt.

PraisonAI CVE-2026-44338: an unauthenticated agent server, exploited in 3h44

Disclosed May 11, 2026, CVE-2026-44338 ships PraisonAI with authentication hard-disabled in its legacy API server. A CVE-Detector scanner hit the endpoint less than four hours later.

INDIRECT INJECTION MEDIUM

ShareLeak (CVE-2026-21520): the first CVE Microsoft assigned to a Copilot prompt injection

Disclosed April 15, 2026, Capsule Security's ShareLeak write-up details an indirect prompt injection in Microsoft Copilot Studio. Microsoft assigned CVE-2026-21520 (CVSS 7.5) — an unusual industry first that reframes prompt injection as a tracked vulnerability class.

ARGUS: a provenance-graph defense for context-aware prompt injection

Published May 5, 2026, the ARGUS paper introduces influence-provenance auditing for LLM agents — dropping attack success from 28.8% to 3.8% on a new context-aware injection benchmark.

The Instruction Hierarchy: training LLMs to rank privileged instructions

OpenAI's 2024 paper proposes a structural defense against prompt injection: teach models that system > user > tool output. The idea is now central to GPT-4o-mini and o-series safety training.

INFRASTRUCTURE CRITICAL

LMDeploy SSRF: when an image loader turns into an AI-infrastructure hijack

CVE-2026-33626 turned LMDeploy's load_image() into a generic SSRF primitive. Honeypots saw the first weaponised exploit 12 hours and 31 minutes after the advisory went live.

2026-05-22//6 min

Localhost agent hijack: cross-origin WebSocket attacks on AI coding agents

CVE-2026-44211 (CVSS 9.7), disclosed May 7, 2026, shows how a single visit to a malicious page can hijack an AI coding agent running on a developer's laptop. The attack class is generic — and architectural.

SUPPLY CHAIN CRITICAL

Mini Shai-Hulud: the supply-chain worm that came for the AI tooling stack

Disclosed May 11–18, 2026, the Mini Shai-Hulud worm trojanised 170+ npm and PyPI packages — including Mistral AI, Guardrails AI and TanStack — and persists inside Claude Code and VS Code.

Output filtering beats model self-defense: 20,000 adaptive attacks, one survivor

Posted April 26 and revised May 12, 2026, a Swept AI / Michigan paper pitted nine prompt-injection defenses against an adaptive attacker. Every model-side defense eventually broke. Application-side output filtering held — zero leaks across 15,000 attacks.

2026-05-22//6 min

Prompts as shells: when prompt injection becomes RCE in agent frameworks

Two CVEs disclosed in Microsoft Semantic Kernel on May 7, 2026 (CVE-2026-25592, CVE-2026-26030) show how a single injected prompt can pivot from text to remote code execution on the agent's host.