SafeHarbor: a hierarchical memory guardrail that targets agent over-refusal
Accepted at ICML 2026, SafeHarbor is a training-free guardrail that injects context-aware safety rules from a self-evolving risk tree — keeping 63.6% benign utility on GPT-4o while refusing over 93% of attacks.
What is this?
On May 7, 2026, Zhe Liu, Zonghao Ying, Wenxin Zhang, Quanchen Zou, Deyue Zhang, Dongdong Yang, Xiangzheng Zhang and Hao Peng published SafeHarbor: Defining Precise Decision Boundaries via Hierarchical Memory-Augmented Guardrail for LLM Agent Safety (arXiv:2605.05704, cs.CR / cs.AI). The paper was accepted at ICML 2026, and the authors shipped working code, pre-built artefacts and an evaluation harness on GitHub.
This is a defense paper, and its starting point is a problem every team that has bolted a safety filter onto a tool-using agent will recognise: the over-refusal tax. Make a guardrail strict enough to block tool-misuse attacks and it starts refusing legitimate work too — the agent becomes safer and less useful at the same time. SafeHarbor’s claim is that you can move both numbers in the right direction at once, without retraining the model.
How it works
SafeHarbor sits in front of the target model as a drop-in, OpenAI-compatible proxy. It is training-free with respect to the underlying LLM: nothing about GPT-4o or whatever model you point it at is fine-tuned. Two components do the work.
The first is a hierarchical Risk Tree — a memory of past attack patterns, clustered into nodes, where each node carries a generated defense_strategy and a benign_boundary_rule. The tree is built offline in two phases. A red-team stage mutates harmful samples through four strategies — benign decomposition, argument injection, scenario disguise, and format-shift — keeping only mutations whose malicious intent survives a verifier-LLM check. A defender stage then generates a per-cluster defense strategy and calibrates each rule against near-identical benign requests, so the rule learns where the line between “block” and “allow” actually sits. An information-entropy signal lets the tree self-evolve by splitting and merging nodes as it grows.
The second is a Safety Projector: a small two-layer MLP that maps the 384-dimensional sentence embedding into a 128-dimensional “safety-aware” space, plus a binary head. Trained with a triplet-plus-BCE loss, its job is to decouple safety-relevant directions from semantic ones in the embedding space — so retrieval matches on “is this dangerous?” rather than “what topic is this about?”, which is exactly the confusion that makes naive embedding filters over-refuse.
At inference, the proxy projects the incoming request, retrieves the most relevant risk evidence from the tree, and injects it as a leading safety context before forwarding the call upstream.
# Conceptual flow — illustrative, drawn from the public SafeHarbor repo.
request --> Safety Projector (384d -> 128d safety space)
--> retrieve top-k nodes from Risk Tree
--> inject {defense_strategy, benign_boundary_rule} as safety context
--> forward to target LLM (no fine-tuning)
Why it matters
The reported numbers are the point. On GPT-4o, SafeHarbor holds a peak benign utility of 63.6% while keeping a refusal rate above 93% on explicit malicious requests, evaluated on AgentHarm and Agent-SafetyBench against RAG, A-Mem, GuardAgent and Llama Guard baselines. Whether those exact figures hold on your workload is unknown — they are single-paper results on two benchmarks, with GPT-4o as the headline model — but the framing is the useful part: a guardrail should be measured on both axes, and “refusal rate” alone is a misleading score.
It also fits a broader 2026 pattern. SafeHarbor is one of several self-evolving, memory-based guardrails appearing this year — alongside Membrane’s contrastive safety memory — that treat the boundary between safe and unsafe as something learned and continuously recalibrated, rather than a static block-list. For builders, that signals a shift from “write better refusal prompts” to “maintain a living memory of attack and benign patterns.”
Defenses
SafeHarbor is itself a defensive control, so the practical question is how to adopt the idea soundly.
Treat any memory-driven guardrail as a layer, not the layer. Because rules are retrieved by similarity, a request that no tree node resembles falls back to the base model’s own judgement — so keep deterministic controls underneath: least-privilege tool scopes, sandboxed execution, and human review where the blast radius is large. SafeHarbor’s own design stacks cleanly on top of prompt-level filters like Llama Guard.
Audit the rules, not just the verdicts. The repo ships a human-readable dump of every cluster’s defense strategy and benign boundary rule. A memory built from auto-generated attacks can encode a biased or stale view of “safe”; review it the way you would review a firewall ruleset, and watch the benign-boundary rules for over-blocking.
Measure both axes before and after deployment. The single most transferable lesson here is methodological: report benign utility and attack refusal together, on a benchmark that includes ambiguous-but-legitimate tasks, or you will ship a guardrail that looks safe and quietly breaks real work.
Finally, mind the retrieval surface itself. A guardrail whose memory grows from ingested attack data inherits the poisoning concerns of any retrieval system — control what gets written into the tree, and keep the build pipeline that evolves it as trusted as the model it protects.
Status
| Item | Reference | Date | Notes |
|---|---|---|---|
| Paper | arXiv:2605.05704 | 2026-05-07 | Accepted at ICML 2026 |
| Code + artefacts | github.com/ljj-cyber/SafeHarbor | 2026 | MIT license; pre-built Risk Tree + Safety Projector shipped |
| Headline result | Benign utility 63.6% / refusal >93% | — | GPT-4o, on AgentHarm + Agent-SafetyBench |
| Baselines compared | RAG, A-Mem, GuardAgent, Llama Guard | — | Reproduction scripts included |
SafeHarbor will not end prompt injection or tool misuse — no single guardrail does. Its contribution is narrower and useful: a concrete, reproducible way to chase safety without paying the full over-refusal tax, and a reminder that any honest guardrail evaluation has to report what it breaks as well as what it blocks.