SecureClaw: a dual-boundary defense for tool-using LLM agents
A June 2026 paper proposes guarding two distinct boundaries at once — authorizing external actions at the effect sink and confining plaintext at the read boundary — reporting 0% attack success on one agent benchmark.
What is this?
On 8 June 2026, Yuhan Ma and Stefan Schmid published SecureClaw: Clawing Back Control of LLM Agents (arXiv:2606.09549), a defensive architecture for tool-using LLM agents. Its starting observation is that a tool-using agent can fail in two different ways, and that most existing defenses only cover one of them.
The first failure is an unauthorized external action: the agent, steered by injected text, sends an email, makes a payment, or writes to an external system it should not touch. The second is plaintext exposure inside the runtime: a secret (an API key, a private record) is read into the model’s context, where it can leak through the final answer or be relayed to another component, before any output filter gets a chance to intervene.
This matters because, as OWASP’s June 2026 State of Agentic AI Security and Governance documents, prompt injection now maps to six of the ten categories in the Top 10 for Agentic Applications, and the incidents are no longer hypothetical. SecureClaw is one attempt to close both boundaries at the architecture level rather than the prompt level.
How it works
SecureClaw is described by its authors as a dual-boundary architecture. It places a control at each of the two failure surfaces above.
At the read boundary, sensitive reads pass through a trusted gateway. Instead of handing the raw secret to the model, the gateway substitutes an opaque handle — a reference the agent can carry around and pass to tools, but cannot itself read. In the evaluated deployment the gateway can also return a bounded summary as an explicit declassification interface: a controlled, limited view of the data rather than the full plaintext. The model plans over handles and summaries; it never dereferences the secret directly.
At the effect sink, any write that changes external state follows a PREVIEW → COMMIT protocol. The agent proposes an action and sees a preview, but only a trusted executor may commit, and it commits exactly the canonical request that policy authorized — not whatever the model assembled. Side effects are gated behind that executor rather than triggered directly by the planner.
untrusted planner (LLM)
|
plans over handles + summaries only
|
read boundary ┌────────┴────────┐ effect sink
trusted │ secret → handle │ PREVIEW → COMMIT
gateway │ + bounded │ trusted executor
│ summary │ commits canonical
└──────────────────┘ authorized request
The design echoes the principle behind the broader Design Patterns for Securing LLM Agents work from June 2025: you don’t make the model immune to malicious text, you constrain what a compromised model is able to do. SecureClaw’s contribution is doing this on both the data-read and the action-write side in one harness.
No exploit payloads are reproduced here; the value for defenders is the boundary placement, not any specific injection string.
Why it matters
The paper reports results across three public agent-security benchmarks — AgentDojo, AgentLeak, and Agent Security Bench (ASB). In the authors’ common evaluation harness, SecureClaw reaches 0% attack success rate on ASB, 0.64% on AgentDojo, and a 3.23% overall leak on AgentLeak’s attacked parity lane, while still retaining usable task performance. These are the authors’ own numbers on their own setup, so read them as a promising single-paper result rather than an independently reproduced guarantee — but the direction is the interesting part.
The reason this framing is useful maps directly onto the threat data. Simon Willison’s “lethal trifecta” (private data + untrusted content + outbound communication) and Meta’s “Agents Rule of Two” both say the same thing: an agent that can read secrets and act externally and be steered by untrusted text is exploitable by a single injected prompt. SecureClaw attacks two legs of that trifecta structurally — the secret never reaches the planner in usable form, and the external action never fires without a trusted commit. That is why guarding only one boundary, as many output-filter or planner-hardening defenses do, leaves the other surface open.
Defenses
If you build or operate tool-using agents, the actionable takeaways do not require adopting this specific paper:
- Separate “read a secret” from “use a secret.” Hand the agent an opaque handle or a bounded summary, not raw plaintext. The model should be able to reference a credential or record without ever having it in context, so an injection cannot exfiltrate what was never readable.
- Gate every state-changing action behind a trusted executor. Adopt a preview-then-commit flow where a non-LLM component commits only the exact, policy-approved request. Never let the planner’s free-form output be the thing that fires the side effect.
- Defend both boundaries, not one. An output filter alone does not stop in-runtime relay leakage; planner hardening alone does not stop unauthorized writes. Map your agent against both failure modes and confirm each has a control.
- Apply the Rule of Two as a budget. Where an agent would combine private-data access, untrusted input, and external communication without a human in the loop, treat that as the case requiring the strongest structural containment — or human approval.
- Validate with adversarial benchmarks. Test candidate defenses against suites like AgentDojo and ASB and watch both attack-success and leakage metrics, not just task utility, before trusting a configuration in production.
Status
| Item | Reference | Date | Notes |
|---|---|---|---|
| SecureClaw paper | arXiv:2606.09549 | 2026-06-08 | Dual-boundary architecture; v1 |
| Reported results | arXiv:2606.09549 | 2026-06-08 | 0% ASR (ASB), 0.64% ASR (AgentDojo), 3.23% leak (AgentLeak) — authors’ harness |
| Design Patterns for Securing LLM Agents | arXiv:2506.08837 | 2025-06-10 | Related “constrain the agent” defensive framing |
| OWASP State of Agentic AI Security 2026 | Help Net Security | 2026-06-11 | Prompt injection maps to 6 of 10 agentic categories |
The honest framing is not “prompt injection is solved.” It is that the most credible defenses are moving from “make the model refuse bad instructions” to “make sure a compromised model cannot read what it shouldn’t or do what it shouldn’t.” SecureClaw is a fresh, concrete example of that shift — worth reading if you own an agent that touches both secrets and the outside world.