RESEARCH MEDIUM NEW

Where agent attacks actually enter: a 247-paper threat-surface map

A June 2026 survey of 247 papers measures where LLM-agent attacks land. User prompts are only one surface among several — mediated channels like web content and tool outputs dominate.

2026-06-18 // 7 min affects: llm-agents, tool-using-agents, web-agents, coding-agents, multi-agent-systems

What is this?

On June 2026, a systematization-of-knowledge paper titled Toward Secure LLM Agents: Threat Surfaces, Attacks, Defenses, and Evaluation (arXiv:2606.10749) was posted. It synthesizes 247 papers through a lifecycle-based, systems-oriented framework, and it does something most threat lists do not: it counts where the attacks actually enter the agent loop. The result is a measured map of the agent threat surface as of mid-2026, not another taxonomy of attack names.

The headline for defenders is a correction of intuition. The instinct is to treat the user prompt as the dangerous input. The corpus says the user prompt is one surface among several, and the more characteristic risk arrives through mediated channels — web pages the agent browses, outputs returned by the tools it calls, and documents pulled in by retrieval.

This is a snapshot of research concentration, not a definitive ranking of all real-world risk. But research concentration is itself useful: it tells you where the field has decided the structural weaknesses are.

How it works

The survey organizes agent security around three interacting properties: information flow, delegated authority, and persistent state. Rather than asking only what input did the agent see, it asks what is the agent now allowed to do because it saw it. Attacks are then located by where they enter the loop and which transition they exploit.

When the corpus is coded by threat surface, the distribution is concrete (papers can touch more than one surface):

Threat surface            Papers      What it means
──────────────────        ──────      ─────────────────────────────────
User Prompts              82          Direct instructions from the user
Web Content               55          Pages fetched while browsing
Tool Outputs              54          Results returned by an invoked tool/API
Retrieved Content         37          Evidence from RAG / search indices
Files / Code              >=25        Local artifacts read, run, or modified
Planning Loop             >=25        Intermediate reasoning / trajectory
Memory / Scratchpads      >=25        State the agent retains for later
Inter-agent Channels      >=25        Messages passed between agents

User Prompts is the single most frequent surface at 82 papers — but Web Content (55), Tool Outputs (54), and Retrieved Content (37) together describe a much larger, mediated attack surface. These channels carry content that is task-relevant but not authority-bearing: the agent ingests it as evidence, then treats embedded instructions as actionable. That is the core failure — the loss of separation between data and control, and between a low-authority observation and a high-authority instruction.

The attack-family counts confirm the same shape. In the threat-model coding, Prompt Injection appears in 142 papers and Indirect Prompt Injection in 86. Broken down by deployment scenario, web browsing shows Prompt Injection 71 times and Indirect Prompt Injection 44 times; software-engineering agents show 32 and 16. Prompt injection is not one attack among many in this literature — it is the dominant mechanism by which untrusted content becomes unsafe control. This matches what practitioners reported independently in June 2026: prompt injection still drives most agentic failures seen in production.

The survey’s second move is to frame the most dangerous events as transitions, not components. Harm tends to occur when untrusted content is reinterpreted as a planning constraint, when a tentative plan becomes a committed action, or when a stored trace is later reused as trusted context. This is also why memory poisoning and multi-agent contagion are flagged as the emerging frontier — they are delayed and propagating forms of the same control-flow problem.

Why it matters

Three concrete takeaways.

The field is young and preprint-heavy, so calibrate confidence. The corpus grows from 3 papers in 2023 to 42 in 2024 to 121 in 2025, with 81 more by April 27, 2026 (32.79% of the total). arXiv accounts for 169 papers (68.42%). Terminology, threat models, and evaluation protocols are still moving. Treat individual claims as versioned observations, not settled results — which is exactly the discipline the survey itself recommends.

Single-agent evidence dominates, but multi-agent risk is rising. Single-agent systems make up 200 papers (80.97%); multi-agent systems 47 (19.03%). The multi-agent share climbs from 9.52% in 2024 to 23.97% in 2025. If your roadmap involves agents that delegate to or message other agents, you are moving into the part of the threat surface the current evidence base covers least — inter-agent channels, coordination failure, and cross-agent propagation of malicious instructions.

Defenses do not compose, and benchmarks miss the hard cases. The survey finds that current defenses are useful building blocks but weakly compositional, and that existing benchmarks underrepresent long-horizon, stateful, and deployment-sensitive risks. In practice that means a clean score on a single-turn injection benchmark tells you little about whether a stateful, tool-using, multi-step agent will hold up.

Defenses

The survey’s prescriptions translate directly into an architecture checklist. None of these are novel exploits to fear; they are boundaries to build.

Treat mediated channels as untrusted by default. Web content, tool outputs, and retrieved documents carry data, not instructions. Strip or quarantine instruction-like content from these surfaces before it reaches the planning context, and never let retrieved or browsed text silently re-enter the loop as a directive.
Enforce an explicit instruction hierarchy and source legitimacy. The structural failure is the agent treating low-authority observations as high-authority commands. Tag every span by provenance (user vs. tool vs. web vs. memory) and make the model’s policy condition on that tag, so source — not just content — governs what is actionable.
Put privilege control at the action boundary. Because the agent acts under delegated authority the attacker does not own, the durable check is at tool execution: per-action capability checks, least privilege per tool, and human confirmation for high-consequence actions. This constrains the dangerous plan-to-action transition rather than trying to perfectly clean every input.
Make persistent state provenance-aware. Memory and scratchpads are a delayed control-flow channel: poisoned content written today can be retrieved as “trusted” context tomorrow. Record where each memory came from, expire or re-validate it, and never auto-promote stored traces to trusted instructions.
Watch the lethal trifecta. The classic high-risk combination — access to private data, exposure to untrusted content, and an external-communication path — remains the configuration to avoid or tightly gate, as Simon Willison framed it. The 247-paper map is, in effect, a detailed account of how that trifecta gets exploited across surfaces.
Evaluate at the deployment shape you actually run. Single-turn injection scores do not predict multi-step, stateful, or multi-agent behavior. Test long-horizon trajectories, memory reuse, and inter-agent propagation explicitly, because that is where the survey says both defenses and benchmarks are currently weakest.

Status

Item	Reference	Date	Notes
Toward Secure LLM Agents (SoK)	arXiv 2606.10749	2026-06	247 papers; lifecycle + systems framework
Corpus growth	same	2023–2026	3 -> 42 -> 121 papers; 81 by 2026-04-27
Threat-surface counts	same	2026-06	User Prompts 82, Web 55, Tool Outputs 54, Retrieved 37
Attack-family counts	same	2026-06	Prompt Injection 142, Indirect 86
Multi-agent share	same	2024-2026	9.52% -> 23.97% -> 17.28% (partial)
Production corroboration	Help Net Security / OWASP	2026-06-11	Prompt injection still drives most agentic failures

The practical lesson is not that agents are unusable. It is that the dangerous input is rarely the user’s prompt. Once a model browses, calls tools, retrieves, remembers, and talks to other agents, every one of those channels is an entry point — and the boundary that matters is not “is this text malicious” but “is the agent allowed to act on it.” Build for the transitions, not just the inputs.