system: OPERATIONAL
← back to all hacks
DEFENSE MEDIUM NEW

Agent Security Is a Systems Problem: Treat the Model as Untrusted

A May 2026 position paper from Google, UCSD and UW–Madison argues agent security must move out of the model and into the system: treat the LLM as an untrusted component and enforce invariants around it.

2026-06-08 // 8 min affects: chatgpt, claude-code, microsoft-copilot, cursor, devin, deepseek, amp

What is this?

On May 18, 2026, a group of researchers from Google, UC San Diego, the University of Wisconsin–Madison, Meta FAIR, Cornell and EmbraceTheRed published a position paper titled Agent Security is a Systems Problem (arXiv:2605.18991, CC BY 4.0). Its thesis is one sentence long: the AI model powering an agent must be treated as an untrusted component, and security invariants must be enforced at the level of the surrounding system, not inside the model.

That framing is a deliberate break from the dominant approach, which treats the model as the primary object of security and tries to make it robust through alignment and training. The authors — including names well known in adversarial ML and systems security — argue that this is the same bet the field already lost once, in the “classical adversarial ML” era of vision models, where model-based defenses were repeatedly evaded. The paper was picked up in Adversa AI’s June 2026 agentic-security roundup as one of the month’s notable reads.

How it works

The paper maps decades of systems-security doctrine onto agents. The standard architecture it leans on has four parts: a Trusted Computing Base (TCB) whose integrity an attacker cannot affect, a Security Policy that declares what is allowed, a Reference Monitor inside the TCB that checks every request against that policy, and an Untrusted System on the other side of a security boundary. Against that backbone the authors restate five principles agents routinely violate:

PrincipleWhat it requires
Least PrivilegeA component gets only the permissions its task needs, and no more
TCB Tamper ResistanceThe trusted core cannot be modified by untrusted input
Complete MediationEvery request crossing the boundary is checked — none bypasses the monitor
Secure Information FlowSensitive data must not leak to untrusted sinks, even via side channels
Human Weak LinkMechanisms must assume users, admins and developers will make mistakes

To show this is descriptive rather than abstract, the authors analyze eleven real, already-public attacks on shipping agents and tag each against the principles it broke. The list reads like a 2025–2026 incident log: Microsoft Copilot exfiltration, Cursor “AgentFlayer,” Claude Code exfiltration, Devin AI exposed ports and secret leaks, ChatGPT long-term-memory “SpAIware,” ChatGPT Operator prompt injection, DeepSeek account takeover, Terminal DiLLMa, Amp arbitrary command execution, and “AI ClickFix.” In Table 1 of the paper, every single one violates Secure Information Flow, and most violate two or more principles at once — which is the point: these are not eleven unrelated bugs, they are one missing architecture observed eleven times.

Two of their worked examples are instructive without being actionable. In the ChatGPT memory case, an indirect prompt injection in an untrusted document wrote attacker instructions into the app’s trusted “Memories” store (a TCB tamper-resistance failure), the app could reach arbitrary servers regardless of the user’s request (least privilege), and conversation data then flowed out through a rendered image URL (secure information flow). In the Claude Code case, an injected instruction in a code file had the agent read a .env file and smuggle the secret out through an allow-listed ping, whose DNS lookup carried the data — the agent had broader shell access than the task needed, and sensitive data reached an untrusted resolver.

The harder half of the paper is why this is not trivial to fix. Agents do not map cleanly onto the classic architecture. A traditional app is single-purpose, so a developer can write a fixed policy at install time. An agent takes an open-ended natural-language goal, composes tools at runtime, follows links it discovers, and refines underspecified tasks as it goes — so its policy is fuzzy, dynamic, and expressed in prose. The authors compare this to dynamic-code loading on the web, which the browser tamed with Content Security Policy, the Same-Origin Policy, <iframe> sandboxing and Subresource Integrity. Agents have none of those: the provenance of an instruction is hard to establish, and sandboxing via mechanisms like Instruction Hierarchy is probabilistic at best. Worse, using a “safety LLM” as the reference monitor reintroduces the problem you were trying to escape — a probabilistic TCB with no formal contract, itself attackable.

Why it matters

This paper is a frame, not a fix, and that is its value. If you accept that model-based separation of instructions and data “will always be evaded by persistent attackers” — the authors’ explicit conjecture, consistent with the lethal trifecta and contextual-integrity arguments about prompt injection — then pouring effort solely into a more robust model is the wrong budget allocation. The leverage is in the scaffolding around the model.

It also gives defenders a shared vocabulary. “Our agent got prompt-injected” is hard to act on; “this incident violated Least Privilege and Secure Information Flow, and our reference monitor had incomplete mediation over the ping tool” tells you exactly which control to build. The eleven-attack table turns a year of headlines into a checklist of failure modes you can test your own deployment against.

The paper closes by naming three open research problems it admits are unsolved: (1) provable separation of instructions and data (the agentic analogue of W⊕X memory protection, likely necessary but not sufficient); (2) verifiable policy generation — translating fuzzy natural-language intent into a formal least-privilege policy a deterministic monitor can enforce, with a correctness guarantee; and (3) information-flow control that can disentangle data sources after an LLM has mixed them in its context. None ship today. Treat anyone selling “prompt-injection-proof” as overstating the state of the art.

Defenses

The paper’s prescription is defense-in-depth implemented outside the model. Concretely:

  1. Treat the model as untrusted by default. Assume any input the agent ingests — documents, web pages, tool outputs, calendar events, notifications — may carry hostile instructions, and that the model may follow them. Architect so that a compromised model is contained, not catastrophic. This mirrors the “treat the agent like a process” stance.

  2. Enforce least privilege per task, not per session. The Claude Code ping case is a least-privilege failure: the agent held broad shell access at a moment the task didn’t require it. Scope tool access to the current sub-goal and revoke it after. See the rule-of-two pattern for a practical constraint.

  3. Put a deterministic reference monitor in the path of every consequential action. Aim for complete mediation: outbound network, file writes, shell execution and credential reads should all cross one checkpoint that consults an explicit policy. Avoid making an LLM the sole arbiter of that decision.

  4. Add information-flow controls on egress. Most of the eleven attacks ended at an exfiltration channel. Label sensitive sources, and block or sanitize flows from high-trust data to low-trust sinks — rendered image URLs, DNS lookups, webhooks, outbound HTTP. This is also the spirit of vendor moves like OpenAI’s data-egress lockdown.

  5. Design for the human weak link. Approval prompts that misrepresent what is being approved, or that fire so often users click through, are not a control. Reserve human review for high-blast-radius actions and make the prompt say what will actually happen.

  6. Don’t wait for the unsolved parts. Provable instruction/data separation, verifiable policy generation and full IFC are research problems. Until they land, compensate with coarse but deterministic boundaries: sandboxed execution, network allow-lists enforced below the agent, and least-privilege credentials.

Status

ItemReferenceDateNotes
Position paper publishedarXiv:2605.18991v1 [cs.CR]2026-05-18Google, UCSD, UW–Madison, Meta FAIR, Cornell, EmbraceTheRed; CC BY 4.0
Real attacks analyzedPaper §2.2 + Appendix A2024–202611 representative cases mapped to 5 principles; all violate Secure Information Flow
Open problems namedPaper §32026-05-18Instruction/data separation, verifiable policy generation, information-flow control — all unsolved
Community pickupAdversa AI June 2026 digest2026-06-01Listed under “Article”: agent security as a systems, not model, problem

The takeaway is not a new attack. It is a re-categorization: the agent incidents of the last year are not a model-robustness problem the labs will train away, but a systems-security problem the field already knows how to reason about — and has not yet finished engineering for agents.

Sources