DEFENSE

Cognitive Firewall: a split-compute defense for browser agents

A March 2026 eBay paper layers an on-device sentinel, a cloud planner and a deterministic execution guard to cut indirect prompt injection in browser agents from 100% to under 1%.

2026-06-22//6 min

DeepMind's AI Control Roadmap: defense-in-depth for misaligned agents

Google DeepMind's AI Control Roadmap (June 2026) treats internal AI agents as potential insider threats, layering trusted-supervisor monitoring on top of model alignment.

LLM salting: rotating the refusal direction to break jailbreak reuse

SophosAI's 'LLM salting' (CAMLIS 2025) applies a small rotation to a model's refusal direction so that a jailbreak precomputed against the base model no longer transfers to your deployment — the rainbow-table defense, applied to LLMs.

Defensive misdirection: why blocking automated jailbreaks can backfire

A June 2026 paper models the attacker's automated judge and shows that predictable refusals feed the search loop — proposing controlled misdirection instead of plain blocking.

Backdoor unlearning generalizes: removing one trigger can suppress others

A June 2026 paper shows that teaching an LLM to ignore one backdoor trigger can also weaken other, never-targeted backdoors — when their internal activation shifts are close, measured by a new metric called CASD.

Why agent refusals fail: the Cybersecurity Refusal Framework

A new benchmark shows agent safety refusals key off the URL string, not the real target. Two trivial tricks — fake 'rules of engagement' and localhost proxying — flip refusal into compliance on production sites.

2026-06-20//6 min

MCP security: stop asking which attacks exist, ask where defenses must live

An April 2026 arXiv paper maps MCP attacks across six architectural layers and finds defenses are uneven and disproportionately tool-centric — leaving host orchestration, transport and supply-chain layers structurally under-defended.

2026-06-20//7 min

Localizing prompt injection: from detection to forensic excision

Detecting a prompt injection only tells you something is wrong. Two 2026 papers, PromptLocate and WebSentinel, pinpoint exactly which span of context is poisoned so it can be excised and the task recovered.

2026-06-20//6 min

SEAgent: mandatory access control to contain agent privilege escalation

A January 2026 paper reframes agent attacks as privilege escalation — actions exceeding the least privilege a task needs — and proposes SEAgent, a deterministic MAC/ABAC layer that enforces policy over an information-flow graph.

2026-06-20//6 min

Cordon: transactional containment for tool-using LLM agents

A June 16, 2026 arXiv paper proposes 'semantic transactions': a runtime that stages an agent's irreversible tool effects and validates the whole task flow before any commit.

2026-06-19//6 min

AuthGraph: dual-graph alignment to catch agent prompt injection

A May 26, 2026 UCLA paper compares a clean authorization graph against the agent's actual provenance graph, cutting AgentDojo attack success from 40% to 1%.

2026-06-19//6 min

SkillVetBench: an LLM-as-Judge that catches what skill scanners miss

A June 14, 2026 arXiv paper shows code-layer skill scanners miss 89–100% of instruction-layer threats, while an LLM-as-Judge flags all 78 malicious test skills with zero false positives.

2026-06-18//6 min

SafeMCP: look-ahead tool gating against power-seeking in MCP agents

A June 1, 2026 arXiv paper (ACL 2026) proposes SafeMCP, a server-side plugin that uses world-model look-ahead to filter hazardous tool acquisition before an MCP agent over-expands its powers.

2026-06-18//6 min

The lethal trifecta is now the default — defend agents at runtime

The lethal trifecta once flagged risky agents. By mid-2026 it describes every useful one, so architecture-level avoidance no longer works. Defense shifts to five runtime behavioral signals.

2026-06-18//6 min

DoubtProbe: catching jailbreaks that reorganize intent

A June 2026 paper proposes an inference-time defense that treats jailbreak detection as a consistency check: rebuild the request under structural constraints, then flag the prompts whose meaning won't survive the round-trip.

2026-06-18//5 min

Detecting attacks in agent tool-call traffic: content beats graph

A May 2026 arXiv study of MCP tool-call monitoring finds content embeddings drive detection (AUROC > 0.89), graph structure adds little, and naive random splits inflate scores by up to 26 points.

2026-06-17//6 min

RUBAS: rubric-based RL gives agent safety a fine-grained reward signal

A June 2026 paper replaces coarse refuse/comply rewards with four scored rubrics — tool-use, argument, response and helpfulness — to train tool-calling agents that stay safe without losing utility.

2026-06-17//5 min

SkillGuard: a permission framework that governs what an agent skill can do at runtime

A June 2026 paper closes the gap between what a skill injects into an agent's context and what it makes the agent do, using manifests, deny-by-default access control and runtime monitoring.

2026-06-17//6 min

Dummy backdoors: removing unknown LLM backdoors via shared internal mechanisms

A June 2026 paper removes hidden backdoors you can't see by planting one you can: different backdoors share internal activation patterns, so deleting a controllable 'dummy' weakens the unknown one too.

2026-06-17//6 min

Provenance defenses for agent graph memory are blind by construction

An arXiv paper dated June 10, 2026 shows provenance checks on LLM graph memory can be bypassed without forging a single source: untrusted structure reroutes which authenticated facts get selected, and information-flow control never sees it.

Agent privacy is a trajectory problem: OCELOT budgets inference leakage at runtime

An arXiv paper dated June 10, 2026 reframes LLM-agent privacy as posterior-risk control: not filtering each output, but budgeting how much an adversary's belief about a secret may improve across a whole trajectory.

Verified agent skills: capability governance for the SKILL.md supply chain

NVIDIA's May 19, 2026 verified agent skills add risk scanning, cryptographic signing and machine-readable skill cards to the SKILL.md supply chain — a defensive answer to poisoned skills.

Parallax: putting agent safety in the architecture, not the prompt

A position paper published April 14, 2026 argues prompt-level guardrails fail the moment an agent's reasoning is compromised, and proposes structurally separating the part that thinks from the part that acts.

2026-06-16//7 min

Architecting secure agents: a plan-and-policy defense against prompt injection

An NVIDIA position paper (March 31, 2026) argues that indirect prompt injection cannot be fixed at the model alone — and proposes a plan-and-policy system architecture that constrains what an agent may observe and decide.

Why prompt-injection detectors keep failing: the evasion problem in 2026

From keyword classifiers to activation-based drift probes, prompt-injection detectors share one weakness: an adaptive attacker. Two studies report up to ~100% evasion. Treat detection as one layer, never the boundary.

Confidential Computing for Agentic AI: what enclaves can't protect

A May 2026 survey maps confidential computing onto the agentic stack — hardware enclaves can shield agent memory and KV caches from a malicious cloud operator, but they cannot stop prompt injection.

Why jailbreaks transfer between models — and how salting fights back

A study of 20 open-weight models finds jailbreak transfer comes from shared internal representations, not safety-training quirks. A defense called LLM salting rotates the refusal direction to break reuse.

SafeHarbor: a hierarchical memory guardrail that targets agent over-refusal

Accepted at ICML 2026, SafeHarbor is a training-free guardrail that injects context-aware safety rules from a self-evolving risk tree — keeping 63.6% benign utility on GPT-4o while refusing over 93% of attacks.

Prompt injection is unsolved — so contain it at machine speed

At Infosecurity Europe 2026, OWASP's Ariel Fogel called prompt injection an unresolved architectural problem and argued defenders must shift from prevention to runtime containment that runs as fast as the agent.

SecureClaw: a dual-boundary defense for tool-using LLM agents

A June 2026 paper proposes guarding two distinct boundaries at once — authorizing external actions at the effect sink and confining plaintext at the read boundary — reporting 0% attack success on one agent benchmark.

2026-06-14//6 min

PI-Hunter: auditing agents to expose and localize hidden prompt injections

A June 2026 paper from Google researchers reframes prompt-injection red-teaming as auditing — PI-Hunter evolves source-aware test cases to surface where latent injections enter and propagate through an agent, not just whether an attack lands.

2026-06-13//6 min

Tool stream injection: why static agent defenses break, and what verify-before-commit fixes

A January 2026 paper, VIGIL, reframes indirect injection around the tool stream — forged tool descriptions and fake error messages — and shows that the better-aligned an agent is, the more it obeys them.

Inside GitHub Agentic Workflows: a security architecture for CI/CD agents

GitHub Agentic Workflows reached public preview on June 11, 2026 with a security-first design: zero-secret agents in a chroot jail, a workflow firewall, staged-and-vetted writes, and a threat-detection job. The defensive answer to prompt injection in CI/CD.

2026-06-12//7 min

TRUSTDESC: deriving tool descriptions from code to defuse tool poisoning

An April 2026 paper attacks tool poisoning at its root: generate a tool's description from its implementation instead of trusting the author-supplied text, neutralising implicit poisoning that detectors miss.

The Recuse Signal: a robots.txt for agents that hold real credentials

A June 2026 paper proposes an in-band 'deny' signal — emitted over an SSH banner or a PostgreSQL NOTICE — that politely asks an autonomous agent to withdraw. In a pilot it induced 100% recusal, but an authorization framing flipped the strongest model right back.

The Defense Trilemma: why prompt-injection wrappers can't be complete

A Lean 4-verified April 2026 proof shows no continuous, utility-preserving input wrapper can block every prompt injection. Continuity, utility, and completeness cannot all hold at once.

2026-06-12//7 min

AgentDyn: why injection defenses that ace static benchmarks fail in the wild

A February 2026 ICML benchmark, AgentDyn, runs ten leading prompt-injection defenses on dynamic, open-ended agent tasks. Almost all are either insecure or over-defend into uselessness.

Oversight has a capacity: when more agent approvals make you less safe

A June 8, 2026 arXiv paper models the human reviewer behind an agent's approval gate as a fatiguing, finite resource — and shows that escalating more actions can lower realized safety and open a flooding attack.

2026-06-11//7 min

CASA: task-based access control that checks tool calls against the user's real intent

A May 4, 2026 arXiv paper proposes Continuous Agent Semantic Authorization — a zero-trust layer that extracts a user's task from a multi-turn chat and denies tool calls that don't match it.

2026-06-11//6 min

ADR: detection and response for MCP agents, proven at Uber scale

A May 2026 paper from Uber describes a production EDR-style system for MCP agents: full causal telemetry, two-tier detection, and offline red-teaming, running on 7,200+ hosts for ten months.

ePCA: replacing semantic agent guardrails with formal verification

A May 2026 paper proposes ePCA, a guardrail that compiles each agent action into first-order logic and runs an SMT check before execution, blocking unsafe steps as logical deadlocks.

AgentTrust: vetting agent tool calls before they execute

A preprint from May 6, 2026 introduces AgentTrust, a runtime layer that vets each agent tool call before it runs and returns allow/warn/block/review — catching obfuscated shell payloads static guards miss.

Catching model extraction by watching the whole traffic window, not single queries

A June 2026 paper shows a simple distribution test (MMD over query embeddings, calibrated on benign traffic only) detects LLM model-extraction campaigns hidden in mixed API traffic — 0.3% false positives, 100% on pure-attacker streams.

Agent Security Is a Systems Problem: Treat the Model as Untrusted

A May 2026 position paper from Google, UCSD and UW–Madison argues agent security must move out of the model and into the system: treat the LLM as an untrusted component and enforce invariants around it.

2026-06-08//8 min

Need to Know: contextual-integrity query rewriting for LLM delegation

A June 2, 2026 arXiv paper recasts privacy-preserving query rewriting as a contextual-integrity problem: forward a span to a cloud LLM only if the task needs it, not because a PII type matched.

Membrane: contrastive safety memory that adapts guardrails without retraining

A June 4, 2026 arXiv paper proposes Membrane, a self-evolving guardrail that pairs each blocked attack with a near-identical benign request, cutting over-refusal to 7-14% while topping F1 on six jailbreaks.

OpenAI Lockdown Mode: cutting the exfiltration leg of prompt injection

On June 6, 2026 OpenAI extended Lockdown Mode to personal and self-serve Business ChatGPT accounts: a deterministic setting that disables outbound paths attackers use to exfiltrate data via prompt injection.

THRD: a training-free temporal defense against multi-turn jailbreaks

A June 2026 paper argues multi-turn jailbreaks must be judged across the whole conversation, not turn by turn. THRD scores accumulated risk over time and cuts attack success to 0.2–4% without retraining.

Two methodology traps that inflate prompt-injection detector scores

A June 1, 2026 arXiv preprint shows most prompt-injection and jailbreak detector benchmarks lean on per-dataset threshold tuning and undisclosed operating points — two habits that quietly inflate the accuracy you buy.

AgentVisor: an OS-hypervisor pattern that audits every agent tool call

An April 27, 2026 arXiv paper borrows the OS hypervisor idea to defend tool-using LLM agents: a trusted 'visor' audits every tool call and is architecturally blind to untrusted content.

2026-06-07//7 min

Microsoft's agentic failure-mode taxonomy v2.0: zero-click human-in-the-loop bypass

Microsoft's AI Red Team v2.0 taxonomy (June 4, 2026) adds seven agentic failure modes and reports human-in-the-loop bypass as the most consistently exploited — including zero-click chains from a single external input.

2026-06-07//7 min

The agent that writes its own logs: why self-reported agent audit trails can't be trusted

If a compromised agent produces its own activity log, it can omit, alter, or fabricate what it did. Three June 2026 efforts — arXiv's Notarized Agents, an IETF agent-audit-trail draft, and SCITT — converge on the same fix: move the trust boundary off the agent.

2026-06-05//6 min

When embedding-based defenses fail in LLM multi-agent systems

A May 1, 2026 arXiv paper shows that detectors which prune malicious agents by message embedding collapse when attackers craft near-benign text — and proposes token-confidence signals as a more robust replacement.

2026-06-05//6 min

PISmith: adaptive RL red-teaming keeps breaking injection defenses

A March 2026 paper trains an attacker model with reinforcement learning to stress-test prompt-injection defenses in a black-box setting — and 8 state-of-the-art defenses still fall, including on AgentDojo and InjecAgent.

Hybrid BM25 + vector retrieval cut gradient-guided RAG poisoning from 38% to 0%

A March 10, 2026 arXiv preprint shows that adding sparse BM25 alongside dense retrieval blocks an entire class of gradient-optimized RAG corpus poisoning — without touching the LLM.

AgentShield: catching compromised agents with honeytokens and decoy tools

A May 2026 paper turns deception engineering on tool-using LLM agents: fake tools, fake credentials, and parameter allowlists that a hijacked agent trips over. It reports 90.7–100% detection of successful attacks with zero false alarms.

OWASP Agent Memory Guard: a runtime layer against agent memory poisoning

Covered by Help Net Security on June 1, 2026, OWASP's Agent Memory Guard is the first reference implementation for ASI06 — a drop-in layer that screens every agent memory read and write against a YAML policy.

Catching credential exfiltration in LLM agents before the output token

Published June 2, 2026, an arXiv paper detects agent credential leaks before any output token is emitted — combining activation probes, calibrated honeytokens, and multi-turn leakage accounting.

2026-06-04//7 min

Agent Threat Rules: a "Sigma for AI agents" — and what its recall numbers admit

ATR ships open YAML detection rules for agent attacks, now running at Microsoft, Cisco and Gen Digital. Its own benchmarks show why regex detection is a layer, not a perimeter.

2026-06-03//6 min

SnapGuard: catching prompt injection in what the agent sees, not what it parses

An April 2026 paper proposes a lightweight detector for screenshot-based web agents, where text-centric guards are blind. It reads the rendered pixels — gradient stability plus polarity-reversed text — at 1.81s per page.

2026-06-03//6 min

DataShield: when benign fine-tuning quietly erodes a model's safety

A May 29, 2026 arXiv paper shows fine-tuning an aligned LLM on harmless data still degrades its safety, and proposes DataShield to flag the samples responsible before training.

2026-06-03//6 min

Stop scoring jailbreak defenses on attack success rate alone

A May 2026 IEEE S&P paper argues that attack success rate — the field's default metric — hides how jailbreak defenses actually behave. Its Security Cube evaluates them across several axes at once.

2026-06-02//6 min

Dynamic separators: hardening Polymorphic Prompt Assembling against injection

A May 28, 2026 arXiv paper fixes a blast-radius flaw in Polymorphic Prompt Assembling by generating a unique SHA-256 separator per request, cutting one payload's attack success rate from 0.88 to 0.38.

2026-06-02//6 min

The guardrail trade-off triangle: prompt-injection defenses for LLM tutors

A May 2026 benchmark of prompt-injection defenses for educational LLM tutors puts numbers on a hard truth: no single guardrail wins robustness, usability and latency at the same time.

2026-06-01//6 min

Jailbreaks leave a trace: detecting attacks in LLM internal activations

A February 2026 paper and a March 2026 follow-up show jailbreak prompts carve a distinguishable signature into a model's hidden activations — enabling inference-time detection without fine-tuning or an auxiliary judge model.

2026-06-01//6 min

Causal attribution: an emerging defense against indirect prompt injection

A cluster of early-2026 papers — CausalArmor and AttriGuard — defends tool-calling agents by asking which actions are causally driven by untrusted content rather than by the user. A look at the causal-attribution line of defense.

2026-06-01//6 min

One million exposed AI services: what the Intruder scan actually found

On May 5, 2026, Intruder published the results of an internet-wide scan that mapped 1 million exposed AI services across 2 million hosts. The recurring failure is not exotic — it is permissive defaults.

2026-05-29//7 min

MCP needs a trust handshake: attested tool-server admission

A May 22, 2026 arXiv paper proposes mcp-attested — a backward-compatible MCP extension that gates tool dispatch on signed clearance, deny-by-default allowlists, and tamper-evident audit logs.

2026-05-29//6 min

WARD: a co-evolved guard model that holds up against adaptive prompt injection on web agents

A May 14, 2026 NUS paper proposes WARD — a guard model trained against a memory-driven adversarial attacker — and reports near-perfect out-of-distribution recall on web-agent prompt injection.

2026-05-29//7 min

Project Glasswing: 10,000+ critical bugs found by Claude Mythos in a month

Anthropic's May 26, 2026 update on Project Glasswing reports that ~50 partners have used Claude Mythos Preview to find more than 10,000 high/critical-severity vulnerabilities, including 271 latent bugs patched in Firefox 150 — and lays out a controlled-access model for a frontier offensive capability.

2026-05-26//7 min

Agents Rule of Two: Meta's pragmatic answer to unsolved prompt injection

Published Oct 31, 2025 by Meta and re-adopted in Databricks' May 2026 guide, the Agents Rule of Two limits any agent session to two of three risky properties — the most actionable framework while prompt injection remains unsolved.

2026-05-25//6 min

ARGUS: a provenance-graph defense for context-aware prompt injection

Published May 5, 2026, the ARGUS paper introduces influence-provenance auditing for LLM agents — dropping attack success from 28.8% to 3.8% on a new context-aware injection benchmark.

2026-05-22//7 min

The Instruction Hierarchy: training LLMs to rank privileged instructions

OpenAI's 2024 paper proposes a structural defense against prompt injection: teach models that system > user > tool output. The idea is now central to GPT-4o-mini and o-series safety training.

2026-05-22//7 min