Why prompt-injection detectors keep failing: the evasion problem in 2026
From keyword classifiers to activation-based drift probes, prompt-injection detectors share one weakness: an adaptive attacker. Two studies report up to ~100% evasion. Treat detection as one layer, never the boundary.
In brief Prompt-injection “detectors” — the input-side guardrails that flag a malicious instruction before it reaches your model — are widely deployed and easy to oversell. Two peer-reviewed studies, Hackett et al. (LLMSec 2025) and Jahed & Alouani (arXiv 2602.00750, Feb 2026), show that both the old classifier-style detectors and the newer activation-based ones fall to an adaptive attacker, sometimes with near-100% evasion. The lesson is architectural: a detector is a useful filter, not a security boundary.
What is this?
Prompt injection is OWASP’s number-one risk for LLM applications, and the most common first line of defense is a detector: a separate model or ruleset that inspects untrusted input and blocks anything that looks like an injected instruction. Microsoft’s Azure Prompt Shield and Meta’s Prompt Guard are well-known examples.
The uncomfortable result, documented across 2025–2026, is that these detectors are systematically evadable. Hackett, Birch, Trawicki, Suri and Garraghan tested six prominent protection systems and reported evasion rates up to 100% (arXiv 2504.11168, v1 April 2025, accepted to LLMSec 2025). In February 2026, JR Jahed and Ihsen Alouani turned the same lens on the newer generation — activation-based “task-drift” probes that read a model’s hidden layers — and broke them too (arXiv 2602.00750). The detection idea keeps changing; the evasion result keeps repeating.
How it works
There are two distinct evasion families, matched to two detector designs. No working payloads are reproduced here — the point is the class, not a copy-paste exploit.
Against text/classifier detectors — character injection. The injected instruction is rewritten so a human (and the model) still reads it, but the detector’s tokenizer or keyword logic does not. Published techniques include zero-width and Unicode Tag characters, homoglyph substitution, full-width (CJK) characters, leetspeak, character spacing, and base64 wrapping:
Detector sees: garbled / encoded / spaced-out tokens -> "benign"
Model sees: the same string, normalised internally -> follows instruction
Hackett et al. combine this with an adversarial-ML step: a word-importance ranking computed on an offline white-box model is used to make black-box bypasses more reliable.
Against activation-based detectors — adversarial suffixes. Drift probes don’t read characters; they read the shift a malicious instruction causes in the model’s internal activations. Jahed & Alouani append an adversarially optimised suffix to the poisoned input, using a modified Greedy Coordinate Gradient (GCG) search to find a universal suffix that fools every layer-wise probe at once while keeping the injection effective. Reported attack success at evading all detectors simultaneously: 93.91% on Phi-3 3.8B and 99.63% on Llama-3 8B.
The common thread is the adaptive adversary. A detector trained or tuned on yesterday’s attack strings is, by construction, a fixed target — and an attacker who can query or model it can search around it.
Why it matters
If your architecture assumes “the guardrail caught it,” an evasion is silent: the injected instruction reaches the model, and everything downstream — tool calls, data reads, outbound requests — proceeds as if the input were trusted. The blast radius is whatever the agent is allowed to do.
This is not a reason to drop detectors; a good filter still stops the lazy 90% of attacks and buys signal for monitoring. It is a reason to stop counting them as a control that can fail closed. Vendors quote F1 scores on static benchmarks; the studies above measure something different — robustness to an attacker who adapts — and that number is far lower.
Defenses
Detection is a layer. Build as if it will be bypassed.
-
Normalise before you inspect. Most character-injection bypasses die under canonicalisation: Unicode NFKC normalisation, stripping zero-width and Unicode Tag code points, homoglyph folding, whitespace collapsing, and decoding obvious base64. A March 2026 practitioner write-up on the open-source ClawGuard scanner shows deterministic preprocessing (including full-width normalisation) catching a large share of these variants at sub-10 ms latency. Normalisation belongs in front of any detector.
-
Don’t make the detector the boundary. The controls that survive a missed detection are architectural: least privilege on every tool, a hard separation between trusted instructions and untrusted data, and avoiding the “lethal trifecta” of private-data access + untrusted content + an exfiltration channel in the same agent context. Put a human in the loop on consequential actions.
-
Layer detectors so an evasion must beat all of them. A deterministic regex/normalisation stage, a classifier, and an activation probe fail to different attacks. Require a bypass to defeat the stack simultaneously, and fix a single operating point so you measure recall honestly rather than cherry-picking thresholds.
-
Adversarially train what you keep. Jahed & Alouani propose suffix augmentation — generate multiple adversarial suffixes, append one at random during training, and train the probe on the resulting activations. It raises the cost of evasion; it does not close the gap.
-
Assume failure and watch the outputs. Constrain egress, log tool calls, and alert on anomalous action sequences. The injection you cannot detect at the input you may still catch at the action.
-
Red-team your own guardrail. Run these published evasion classes against your stack with tools like garak or promptfoo, at a fixed threshold, before an attacker does.
Status
| Item | Reference | Date | Note |
|---|---|---|---|
| Character-injection + AML evasion, 6 systems | arXiv 2504.11168 | Apr 2025 (LLMSec 2025) | Up to 100% evasion; Azure Prompt Shield, Meta Prompt Guard among targets |
| Adversarial-suffix evasion of activation probes | arXiv 2602.00750 | Feb 2026 | 93.91% (Phi-3 3.8B), 99.63% (Llama-3 8B) at evading all probes |
| Deterministic normalisation defense (ClawGuard) | earezki.com | Mar 2026 | Regex/NFKC preprocessing, sub-10 ms; references 2602.00750 |
| Prompt injection = LLM01 | OWASP LLM Top 10 | 2026 | Still the #1 application risk |
The framing to keep: a prompt-injection detector tells you about the attacks an adversary didn’t bother to hide. Treat a clean result as the absence of a lazy attack, not the presence of safety — and put your real controls where the model acts, not where it reads.