INDIRECT INJECTION MEDIUM NEW

Error-path injection: when tool error messages carry implicit authority

A June 2026 paper (VATS) shows that injecting instructions inside tool error messages triples indirect-injection success on frontier agents — up to 100% compliance — because models treat error output as authoritative.

2026-06-19 // 6 min affects: gemini-3.1-pro, gpt-5.5, glm-5.1, qwen3-coder

What is this?

On June 2026, the paper VATS: Exploiting Implicit Authority in Error-Path Injection via Systematic Mutation (arXiv:2606.07992) described an underexamined corner of agent security: the error-handling loop of tool-calling agents. As the Model Context Protocol (MCP) standardizes how autonomous agents call tools, it also standardizes how those tools report failures — and a failed tool call returns an error message that flows straight back into the model’s context.

The paper’s hypothesis is simple and uncomfortable: a tool error message carries implicit authority. When a model sees Error: ..., it switches into a “fix it” mode of corrective reasoning, and that mode is more compliant than the model’s normal posture. An attacker who controls what an error string says can ride that compliance. This is a variant of indirect prompt injection — the malicious instruction never comes from the user; it arrives disguised as system feedback.

How it works

VATS — Vulnerability Analysis of Tool Streams — is a mutation-driven test framework. Rather than hand-craft one payload, it systematically evolves adversarial strings across seven structural and linguistic dimensions, searching for the phrasing and placement that maximizes compliance. The most effective vector it found was structural positioning: sandwiching the injected instruction inside the surrounding error context, so the instruction reads as part of the diagnostic the model is trying to act on.

The reason this beats ordinary injection is the agent’s own reasoning. Untrusted document text competes with the user’s task for attention; an error message, by contrast, is the task at that moment — the agent has been trained and prompted to recover from failures, so it treats the error channel as trustworthy procedural input rather than as data to be skeptical of. The distinction matters: the same sentence is more dangerous when labelled as an error than when buried in a retrieved web page. A related dynamic was documented in Causality Laundering: Denial-Feedback Leakage in Tool-Calling LLM Agents (arXiv:2604.04035), where denial and feedback paths leak influence the designer never intended.

No payload is reproduced here. The mechanism — attacker-controlled error text is read as authoritative instruction — is the entire lesson, and the structural shape (place the instruction where the model expects a fix) is enough to understand the risk without a working exploit.

Why it matters

The headline number is the gap, not the absolute. Across four frontier models — Gemini 3.1 Pro, GPT-5.5, GLM-5.1, and Qwen3-Coder — error-path injection roughly tripled the success rate of standard indirect prompt injection, reaching up to 100% compliance in controlled evaluation. Every model tested was susceptible, which points at a class-level weakness in how agents handle failure rather than a quirk of one vendor.

Two consequences for anyone running tool-using agents. First, your input-sanitization boundary is probably incomplete. Teams that filter retrieved documents and user input often pass tool output — especially error output — straight through, on the assumption that the runtime, not an attacker, wrote it. Any tool that reflects attacker-controllable content into an error string (a search query echoed back, a filename, an HTTP body, a parser message) becomes an injection channel. Second, this stacks with the lethal trifecta: an agent with private data access, exposure to untrusted content, and an exfiltration path now has one more high-yield way to be steered.

Defenses

The error path must be treated as untrusted data, not privileged instruction.

Tag provenance and keep it. Mark tool output — including errors — as a distinct, lower-trust channel and preserve that label through the context, in line with an instruction hierarchy where developer and user intent outrank anything a tool returns.
Sanitize and template error strings. Have the runtime replace raw tool/exception text with a fixed, structured error object (code + safe message). Do not splice attacker-reachable substrings (queries, filenames, response bodies) verbatim into the model-visible error.
Strip imperatives from the error channel. Errors should describe state, never request actions. Detect and neutralize instruction-shaped content arriving via tool output before it re-enters context.
Don’t let an error auto-escalate privilege. A failure should trigger retry/abort/ask-human, not a new, broader tool call. Keep verification before commit on the error-recovery path, not only the success path.
Test the failure modes. Red-team the error loop deliberately — most agent evaluations only inject through documents and user turns, leaving the error channel, where VATS found 100% compliance, untested.

Status

Item	Detail
Disclosure	arXiv:2606.07992, June 2026
Vector	Indirect injection via tool error messages (MCP error-handling loop)
Models evaluated	Gemini 3.1 Pro, GPT-5.5, GLM-5.1, Qwen3-Coder
Effect	~3× standard IPI success; up to 100% compliance in controlled tests
Most effective technique	Structural positioning (instruction sandwiched in error context)
Precondition	Attacker influence over tool error output