ChatInject: forging chat-template role tags to bypass the instruction hierarchy
An ICLR 2026 paper shows that wrapping an indirect-injection payload in a model's own chat-template tokens forges a higher-priority role, lifting attack success from 5% to 32% on AgentDojo and to 52% with multi-turn.
What is this?
ChatInject is an indirect prompt injection technique that abuses the chat template — the special role tokens (<|system|>, <|user|>, <|assistant|>) that models use to segment a prompt into roles. Instead of hiding a plain-text instruction in a tool output, the attacker embeds the model’s own role tags inside that output, forging a higher-priority role and tricking the agent into treating attacker-controlled data as an authoritative instruction.
The method comes from “ChatInject: Abusing Chat Templates for Prompt Injection in LLM Agents” by Hwan Chang, Yonghyun Jun and Hwanhee Lee (Chung-Ang University), published as a conference paper at ICLR 2026. The current revision (arXiv:2509.22830v3) is dated 13 April 2026, with code on the project page and GitHub. It is directly relevant to anyone deploying tool-using agents, because it targets the very mechanism — the instruction hierarchy — that vendors built to stop indirect injection.
How it works
Modern agents are trained to enforce a role-based priority order: system > user > assistant > tool output. The order is implemented with special tokens that mark where each role begins and ends. The paper’s key observation is that this segmentation is only as trustworthy as the channel it travels on — and tool output is the least trustworthy channel of all.
When an attacker who controls a tool’s return value (a web page, an email, a file, an API response) writes the model’s role tokens into that value, the model can re-read the forged tags and promote the wrapped content to a higher-priority role. A vanilla “ignore previous instructions” line that the agent would normally discard becomes, once wrapped in forged system/user tags, something the model reads as coming from its operator.
Tool output (attacker-controlled), schematically:
<forged_system_tag> [authoritative-sounding preamble]
<forged_user_tag> [ATTACKER INSTRUCTION — REDACTED]
The agent parses the forged tags, mis-attributes the wrapped text to a
privileged role, and acts on it instead of filtering it as data.
The paper defines four payload variants — plain text, plain text + ChatInject wrapping, a plain multi-turn dialogue, and multi-turn + ChatInject — plus agentic hooks that mimic a model’s <think> reasoning and <tool> call tokens. The strongest variant fabricates an entire multi-turn conversation inside a single tool output: a synthetic 7-turn user/assistant dialogue (generated with GPT-4.1) that gradually “justifies” the malicious action, so the agent encounters what looks like a prior, consented exchange. No real interaction is needed; the persuasion is forged wholesale. This is the one-shot equivalent of a multi-turn jailbreak, smuggled through a single injection point.
Measured on two standard benchmarks (AgentDojo and InjecAgent) across nine frontier models, the wrapping lifts average attack success rate (ASR) from 5.18% to 32.05% on AgentDojo and from 15.13% to 45.90% on InjecAgent; the multi-turn + ChatInject variant reaches 52.33% on InjecAgent. The payloads also transfer across models — a wrapper built from one model’s template often works on another, including closed-source models (GPT-4o, Grok-3, Gemini-2.5-Pro) whose templates are proprietary, because many of them are structurally close to popular open-source templates. A “mixture-of-templates” payload that bundles several role-tag conventions at once raises the odds of matching an unknown backbone. Models with weak role delimiters (Grok-2) were notably more resistant.
Why it matters
ChatInject does not invent a new capability so much as expose a load-bearing assumption. The instruction hierarchy is the headline defense against indirect injection, and it is enforced in band — with tokens that live in the same byte stream as untrusted data. Any boundary that an attacker can also write is not a boundary.
Three consequences stand out. First, transferability means an attacker does not need to know your model: a generic wrapper or a template mixture is enough to get meaningful ASR against closed-source endpoints. Second, the multi-turn variant turns prompt injection — historically a one-shot problem — into a persuasion problem, importing the effectiveness of conversational jailbreaks into the tool-output channel. Third, and most important for defenders: the paper shows that the prompt-based defenses many teams rely on do not hold, and that the obvious fix (stripping role tags) is defeated by trivial perturbations.
Defenses
The mandatory takeaway is that no single in-context control is sufficient. Layer the following.
-
Strip and normalize role tokens from tool output — but don’t stop there. Remove or escape special role/template tokens in any data the agent ingests, at the parser/tokenizer boundary, and normalize whitespace and Unicode so character-level edits can’t smuggle tags past a literal filter. The paper shows that perturbing ~10% of characters defeats naive rule-based stripping, so treat this as necessary, not sufficient.
-
Separate data from instructions architecturally. The root cause is that tool content shares a channel with privileged instructions. Favor designs that keep retrieved/tool data in a structurally distinct, non-elevatable position rather than relying on in-prompt tags the model is trained to trust. Models and harnesses that never promote tool content to a higher role are the durable fix.
-
Gate the action, not just the text — least privilege. ChatInject only matters if a compromised agent can do something harmful. Require explicit policy checks or human confirmation before high-impact tool calls (sending money, posting, deleting, exfiltrating), and scope tool credentials tightly. This is the same logic as the lethal trifecta and agents rule of two: if untrusted input, sensitive capability, and exfiltration can’t co-occur, a high ASR on the text layer doesn’t become a breach.
-
Don’t rely on prompt-based mitigations alone. Sandwich/user-instruction repetition, instructional-prevention prompts, and naive delimiting were largely ineffective in the paper — and sometimes raised ASR. Keep them as hygiene, never as your primary control.
-
Use detectors as defense-in-depth, with care. External detectors (a DeBERTa PI detector, Lakera Guard) reduced ASR but reacted mostly to the presence of special tokens, missed contextual multi-turn persuasion, and caused high false-positive rates; their coarse “drop the whole tool output” failure mode stalls the agent. Tune for redaction over wholesale removal and pair with the controls above.
-
Re-test your own stack. The benchmarks (AgentDojo, InjecAgent) and the attack code are public. Run ChatInject-style payloads against your model and template family before an attacker does — and don’t assume a closed-source model is immune, since template proximity to open-source families drove much of the cross-model transfer.
Status
| Item | Reference | Date | Notes |
|---|---|---|---|
| ChatInject paper (v3) | arXiv:2509.22830 [cs.CL] | 2026-04-13 | Published at ICLR 2026 |
| Benchmarks used | AgentDojo, InjecAgent | — | ASR + utility-under-attack metrics |
| Models evaluated | Qwen3, GPT-oss-120b, Llama-4, GLM-4.5, Kimi-K2, Grok-2 (open); GPT-4o, Grok-3, Gemini-2.5-Pro (closed) | — | Grok-2 most resistant (weak role delimiters) |
| Headline result | AgentDojo 5.18%→32.05% ASR; InjecAgent 15.13%→45.90%; multi-turn 52.33% | — | Averages across models |
| Defenses tested | PI detector, Lakera Guard, instructional prevention, data delimiters, user-instruction repetition | — | Prompt-based defenses largely ineffective; tag-stripping bypassed by perturbation |
| Code | project page / GitHub | — | Public, for evaluation |
The right framing is not “another jailbreak number.” It is that the delimiter the industry chose to mark trust can itself be written by the attacker, and that defending tool-using agents requires moving the trust boundary out of the prompt and onto the action.