system: OPERATIONAL
← back to all hacks
RESEARCH MEDIUM NEW

Why LLM agent defenses don't compose: lessons from 247 papers

A June 2026 systematization of 247 papers finds agent defenses are useful building blocks but weakly compositional, and benchmarks still miss long-horizon, stateful risk.

2026-06-18 // 6 min affects: tool-using llm agents, coding agents, browser agents, memory-augmented assistants, multi-agent systems

What is this?

Toward Secure LLM Agents: Threat Surfaces, Attacks, Defenses, and Evaluation (arXiv 2606.10749, posted June 2026, submitted to ACM TOSEM) is a systematic literature review of 247 papers on LLM agent security published between 1 January 2023 and 27 April 2026. Rather than catalogue attacks one by one, it builds a single, systems-oriented model of the agent and asks four questions: how agent security should be modelled, which threat surfaces dominate, what defenses exist and at what cost, and how security claims are actually evaluated.

Its headline conclusion is sobering and useful at once: the field has produced many credible individual defenses, but they remain weakly compositional — stacking them does not reliably add up to a secure system — and current benchmarks underrepresent long-horizon, stateful, and deployment-sensitive risk. The review is a synthesis of public research, not a new attack, which makes it a clean lens for defenders.

How it works

The authors model an agent as a loop over seven elements: A = ⟨I, P, D, T, M, O, C⟩Input and observations, Planning, Decision/commitment to an action, Tool or environment execution, Memory or persistent state, Outputs and side effects, and Coordination with humans, monitors, or peer agents. Security-relevant behaviour emerges not from any single element but from the flows between them: low-authority content arriving in I can distort planning in P, change the committed decision D, trigger a privileged tool call T, poison state M, or propagate through C to other agents.

This framing brings agent security back to classical systems concepts — trust boundaries, mediation, capability control, provenance, and containment — and explains why text-only “say something unsafe” framing is too narrow. Coding the corpus against this loop, the review reports where the research concentrates: tool-use security (156 papers), runtime defense (88), prompt-injection security (75), multi-agent security (63), and memory safety (32), with planning implicated as a lifecycle stage in 227 papers. The literature itself is growing fast — 3 papers in 2023, 42 in 2024, 121 in 2025, and 81 already collected by late April 2026.

Why it matters

Two structural findings should change how teams reason about agent risk. First, defenses don’t compose cleanly. A prompt-injection filter, an output guardrail, and a tool allow-list each close part of the loop, but the review finds little evidence that combining them yields predictable, end-to-end safety — gaps reappear at the seams between input handling, planning, and execution. Treating “we added three guardrails” as “we are secure” is exactly the assumption the paper warns against.

Second, evaluation lags deployment. Most benchmarks still measure immediate attack success in bounded, single-turn environments, while the risks that hurt in production — memory corruption that survives sessions, privilege misuse, and malicious instructions propagating across multi-agent workflows — are precisely the ones least well measured. Multi-agent settings are still a minority of the corpus (47 of 247 papers, about 19%), even as their share of new work climbs from roughly 10% of 2024 papers toward the low-to-mid 20s in 2025. In other words, the part of the field that matters most for realistic deployments is the part with the least mature evidence base.

Defenses

The review’s prescriptive section is its most actionable output. It argues secure agents require four things working together, not in isolation:

  • Explicit trust boundaries. Tag and treat each information source (system prompt, user turn, tool output, retrieved document, peer-agent message) according to its authority, and design the loop so low-authority content in I cannot silently become an instruction in P or D.
  • Principled privilege control. Constrain what tool execution T can do per task — least privilege, scoped credentials, and human confirmation on consequential actions — so a hijacked decision cannot reach a high-impact capability.
  • Provenance-aware state management. Track where entries in memory M came from and validate them on read, because persistent-state corruption is the emerging risk class the paper flags as under-defended.
  • Realistic, compositional evaluation. Test the whole loop over long horizons, with stateful and multi-agent scenarios, and measure safety and utility, latency, and cost together — not just single-turn attack-success rates.

The practical takeaway: defense-in-depth for agents only works if you reason about the seams between layers, and if your evaluation exercises the same stateful, long-horizon conditions your agents will face in production.

Status

This is peer-track academic research (a systematic review submitted to ACM TOSEM), not a vulnerability in a named product, so there is no patch or CVE attached. Key date: arXiv preprint posted June 2026 (arXiv 2606.10749), covering literature through 27 April 2026. The authors’ own framing is the operative takeaway — agent security is a systems problem, and the open challenge is making defenses and evaluations compose around the full agentic loop rather than around isolated attacks.

Sources