system: OPERATIONAL
← back to all hacks
DEFENSE LOW NEW

AgentVisor: an OS-hypervisor pattern that audits every agent tool call

An April 27, 2026 arXiv paper borrows the OS hypervisor idea to defend tool-using LLM agents: a trusted 'visor' audits every tool call and is architecturally blind to untrusted content.

2026-06-07 // 7 min affects: gpt-4o, glm-4.7, llm-agents

What is this?

On April 27, 2026, Zonghao Ying and colleagues published AgentVisor: Defending LLM Agents Against Prompt Injection via Semantic Virtualization (arXiv:2604.24118, cs.CR, CC BY 4.0). It is a defense paper, not an attack. The idea is borrowed straight from operating systems: in OS virtualization a hypervisor isolates an untrusted guest from privileged hardware. AgentVisor applies the same separation to a tool-using LLM agent — it treats the agent itself as an untrusted guest and puts a trusted visor in front of every tool call.

The motivation is an observation the authors call the awareness–action gap: modern agents can often recognise a prompt injection when you ask them directly, yet still act on the injected instruction during tool use. Knowing an input is hostile does not reliably translate into safe behaviour. AgentVisor closes that gap by moving the decision out of the agent and into a separate component that the attacker’s text cannot reach.

How it works

The agent (the Guest) sees the full task context, including any adversarial retrieved content, and proposes a tool call. The Visor (the semantic hypervisor) audits that proposal before it executes. The critical design choice is context isolation: the Visor is architecturally blind to the raw external content. It never reads the web page, email, or document the agent fetched — only the trusted system instruction, the user’s query, and a sanitised execution history that contains structured fields (tool_name, canonicalised args, a short return_summary/status), wrapped in strict delimiters and treated as data, never as instructions. Because embedded attacker text never reaches the auditor, it cannot talk the auditor into approving a malicious action.

Each proposed call must pass a three-stage STI protocol in order:

  • Suitability (S) — is this tool even permitted for the assistant’s role under the trusted system instruction? A coarse least-privilege gate, effective against direct injection that tries to escalate the agent’s capabilities.
  • Taint (T) — does the goal implied by the call align with the user’s request and legitimate task state? This is the information-flow check that blocks indirect injection: it rejects new goals the user never asked for, such as forwarding, exfiltrating, or posting content when the user only asked to summarise.
  • Integrity (I) — are the arguments consistent with the entities the user actually specified? This catches the case where the tool choice is reasonable but the arguments have been redirected (e.g. a correct “send email” with an attacker’s recipient).

Instead of simply blocking on failure — which often breaks the task — the Visor follows a trap–audit–recover loop. On a failed check it injects a structured Semantic Exception describing which stage failed (type), the violated_rule, a rationale, hard constraints (“do not forward or share data; only summarise”), and the allowed_objective restating the user-aligned goal. The Guest then regenerates the tool call once against those constraints and executes the revised version. This one-shot self-correction is what keeps utility high while still refusing the injected action.

Guest proposes tool call


   Visor audit  ──►  S (role allowed?)  ──►  T (goal = user intent?)  ──►  I (args match user?)
        │                                                                      │
     all pass                                                              any fail
        │                                                                      │
        ▼                                                                      ▼
   execute as-is                                          inject Semantic Exception → Guest
                                                          regenerates ONCE → execute revised call

The authors evaluate against non-optimisation attacks — Direct, Ignore, Escape, FakeComp, Combined, System, and Important — using OpenPromptInjection for direct injection (7 NLP tasks, 4,900 attack cases) and AgentDojo for indirect injection across Banking, Travel, Slack, and Workplace environments (629 attack cases), on GPT-4o and GLM-4.7 backbones.

Why it matters

Most deployed prompt-injection defenses are either heuristic prompt hardening (overridable), input/output classifiers (evadable, and they terminate the task on a false positive — see the operating-point problem in detector benchmarks), or coarse tool sandboxes that cannot track information flow across a multi-step workflow. AgentVisor’s contribution is to package three well-understood OS principles — privilege separation, least privilege plus information-flow control, and exception handling instead of process kill — into one pattern, and to make the auditor structurally unable to see attacker-controlled bytes. That blindness is the same instinct behind the dual-LLM and other patterns in Design Patterns for Securing LLM Agents (Beurer-Kellner et al., June 2025): keep the privileged decision-maker away from untrusted data. It is also the architectural answer to the lethal trifecta — private data, untrusted content, and an exfiltration channel in one agent — that Simon Willison has documented at length.

The reported numbers — attack success rate down to 0.65% with only a 1.45% average utility loss versus no defense — are promising, but read them as what they are: results on two benchmark suites, two backbones, and a fixed set of handcrafted (non-optimisation) attacks. The paper itself includes a section on robustness against adaptive attacks and a limitations discussion; the auditor is still an LLM making semantic judgements, so its decisions are probabilistic, not proofs.

Defenses

The takeaway is a pattern you can apply even without adopting this specific framework:

  1. Separate the decision-maker from untrusted bytes. Whatever audits a tool call should not ingest the raw fetched content. Give it the system policy, the user request, and a structured summary of what happened — not the attacker’s payload.
  2. Pass tool calls through a least-privilege gate. Check first whether the tool is allowed for this agent’s role at all, before looking at arguments. Many direct-injection escalations die here.
  3. Check goal alignment, not just content. The high-value test is whether a call introduces a goal the user never requested (forward, post, send externally). A correct-looking tool call pursuing the wrong objective is the signature of indirect injection.
  4. Validate arguments against user-specified entities. Confirm that recipients, targets, and constraints trace back to the user or to trusted task state, not to retrieved content.
  5. Prefer recover-and-retry over hard-block. Returning a structured, machine-readable reason and letting the agent re-plan once preserves utility far better than killing the task — and avoids the false-positive cost that makes teams disable guardrails.
  6. Keep this as one layer. Semantic auditing is a strong control, not a guarantee. Pair it with hard sandboxing, output gating, and explicit limits on what a compromised agent can reach, so a missed judgement is not a full compromise.

Status

ItemReferenceDateNotes
AgentVisor preprintarXiv:2604.24118v1 [cs.CR]2026-04-27Ying et al.; CC BY 4.0
Reported resultAgentVisor2026-04-27ASR 0.65%, −1.45% avg utility vs No Defense
Direct-injection evalOpenPromptInjection2026-04-277 NLP tasks, 4,900 attack cases
Indirect-injection evalAgentDojo2026-04-27Banking/Travel/Slack/Workplace, 629 cases
Backbones testedGPT-4o, GLM-4.72026-04-27model-agnostic design claimed

AgentVisor is not a finished product you can install — it is a defensive pattern and a set of benchmark results. Its lasting idea is the cleanest one: the component that decides whether a tool call is safe should never be able to read the text an attacker controls.

Sources