AGENTS MEDIUM NEW

Brittle agents: indirect injection survives multi-step tool calls

An April 4, 2026 paper tests 6 defenses against 4 indirect-injection vectors across 9 LLM backbones in multi-step agents — advanced injections bypass nearly all of them, and some surface mitigations backfire.

2026-06-02 // 6 min affects: tool-calling-llm-agents, multi-step-agents

What is this?

On April 4, 2026, researchers posted Your Agent is More Brittle Than You Think: Uncovering Indirect Injection Vulnerabilities in Agentic LLMs on arXiv. The paper is not a new attack. It is a systematic measurement of how badly current indirect prompt injection (IPI) defenses hold up once you stop testing them in a single turn and start testing them where agents actually run: dynamic, multi-step, tool-calling workflows.

The setup is broad. The authors evaluate six defense strategies against four indirect-injection attack vectors across nine LLM backbones, in environments where the agent autonomously retrieves third-party content, that content contains an embedded malicious instruction, and the agent then keeps calling tools. The headline finding is uncomfortable: advanced injections bypass nearly all of the baseline defenses, and some surface-level mitigations are not just ineffective but counterproductive — they make things worse.

How it works

Indirect prompt injection hides attacker instructions inside data the agent is expected to read — a web page, a document, a tool result, an email body. The agent fetches it as part of a normal task, and the buried instruction is treated as if it came from the user. This is the data-versus-instruction confusion at the heart of the lethal trifecta: private data access, exposure to untrusted content, and an exfiltration path, all in one agent.

What this paper adds is the multi-step dimension. Single-turn benchmarks ask “does the model obey the injected line right now?” A real agent does not stop there. It plans, retrieves, calls a tool, reads the result, plans again. The injected instruction has many turns to take effect, and an early misstep compounds across the chain. The authors measure this with a Hijack Ratio — how often the agent’s trajectory is diverted toward the attacker’s goal — and report consistently high ratios across backbones.

Two mechanistic observations matter for defenders.

First, agents could not reliably tell the malicious component apart from legitimate content. The paper reports a near-absence of stable linguistic patterns that distinguish injected instructions from benign data. That is a direct blow to the dominant defense family — prefix tags, role tags, “the following is untrusted data” delimiters — which all assume the model can be steered to recognize a boundary it apparently does not robustly perceive.

Second, some surface mitigations backfired. Adding more warning scaffolding around untrusted content can raise the agent’s attention to the injected block instead of lowering its influence, producing worse outcomes than no mitigation at all. This is consistent with the broader taxonomy work on agent prompt-injection threats (February 2026), which finds that context-dependent agent tasks defeat defenses tuned on context-free benchmarks.

Why it matters

The result is a freshness signal about the state of agent security, not a payload. If you ship a tool-calling agent and your IPI defense was validated on single-turn refusal tests, this paper is telling you that number is optimistic by a wide margin. The gap between “passes the benchmark” and “survives a multi-step run against attacker-controlled content” is where most production agents live.

It also narrows the set of defenses worth investing in. Input-side, prompt-layer mitigations — delimiters, tags, “ignore anything that looks like an instruction” — are the ones that fail here, and occasionally the ones that backfire. The defenses that survive are the ones that act on the agent’s internal state or its actions, not on the surface form of the text.

Defenses

The paper’s own positive result points the way, and it lines up with several other 2026 lines of work.

Detect at the representation layer, not the prompt layer. The authors test Representation Engineering (RepE) as a defense and report that a RepE-based circuit breaker identifies and intercepts unauthorized actions before the agent commits to them, with high detection accuracy across the nine backbones. This is the same family as representation-based jailbreak detection: monitor internal activations for the signature of a hijack rather than trying to sanitize the input string.
Gate the action, not the text. Because agents cannot reliably classify injected instructions linguistically, put the control at the tool-call boundary: least-privilege tool scopes, allowlisted parameters, and explicit human confirmation for destructive or exfiltrating actions. A diverted plan that cannot reach a dangerous tool is a contained failure.
Attribute tool invocations to their cause. AttriGuard (March 2026) defends IPI by causal attribution of tool calls — distinguishing actions that follow from the legitimate task from actions injected by retrieved content. See our coverage of causal attribution as an indirect-injection defense for the broader approach.
Shrink the untrusted surface that reaches the planner. Pass third-party content through structured extraction or summarization with a clean model before the agent reasons over it, keep tool definitions and the system prompt in a separate segment, and avoid dumping large raw blobs into context where an injected instruction can accumulate influence over many steps.
Test adaptively and multi-step. Do not certify an agent on single-turn injection strings. Replay the attack across the full tool-calling trajectory and measure a hijack ratio, not just first-turn refusal. A defense that holds for one turn routinely collapses by step three.

Status

Item	Reference	Date	Notes
Brittleness paper	arXiv 2604.03870	2026-04-04	6 defenses × 4 IPI vectors × 9 backbones, multi-step
Key positive result	RepE circuit breaker	same paper	Intercepts unauthorized actions before commit
Threat taxonomy + AGENTPI	arXiv 2602.10453	2026-02	Context-dependent agent tasks defeat context-free defenses
AttriGuard defense	arXiv 2603.10749	2026-03	Causal attribution of tool invocations
Framing	The lethal trifecta	2025-06	Why agents with data + untrusted input + exfiltration are exposed

The takeaway is not “another IPI paper.” It is that the defenses most teams ship — prompt-layer tags and warnings — are the ones this evaluation breaks, sometimes making the agent more obedient to the attacker. The mitigations that survive watch the agent’s internal state and constrain its actions. Re-baseline your agent against a multi-step, adaptive injection, or treat your single-turn pass rate as fiction.