PROMPT INJECTION MEDIUM NEW

Automated prompt injection is model-dependent: TAP beats GCG, GPT-5 resists

A June 9, 2026 ETH Zurich study adapts GCG and TAP to AgentDojo across 80 agent task pairs. Black-box TAP beats gradient-based GCG, yet attacks tuned on small models fail to transfer to GPT-5.

2026-06-25 // 6 min affects: qwen3-4b, gemma3-4b, gpt-5, gpt-5-mini, claude-sonnet-4.5, gemini-2.5-flash, qwen3-235b

What is this?

On June 9, 2026, three ETH Zurich researchers — David Hofer, Edoardo Debenedetti and Florian Tramèr — published Assessing Automated Prompt Injection Attacks in Agentic Environments (arXiv:2606.10525). It is the first systematic measurement of whether the automated attack methods that work for jailbreaking also work for indirect prompt injection (IPI) against tool-calling agents. The short answer: they work, but unevenly. Against small open-weight models the success rates are real; against a frontier model (GPT-5) they collapse, and attacks optimized on small models do not transfer up. Automated injection is a credible threat — but a strongly model-dependent one.

How it works

The team adapted two known jailbreak optimizers to the agentic setting inside AgentDojo, the standard benchmark for agents acting on untrusted data. The white-box method is GCG, which uses gradients to search for an adversarial token string; the black-box method is TAP, which uses an attacker LLM to iteratively rewrite an injection and prune dead ends. No payloads are reproduced here — the contribution is the measurement, not a recipe.

The evaluation spans 80 task pairs across four domains (workspace, banking, travel, slack). The headline numbers, on the small Qwen3-4B target:

Method (Qwen3-4B target)      Attack Success Rate
----------------------------  -------------------
Universal TAP (black-box)     45.2%
Single-task TAP               44.6%
Universal GCG (white-box)     24.1%
Single-task GCG               23.0%

Two structural findings stand out. First, black-box beats white-box: TAP roughly doubles GCG’s success, which the authors attribute to GCG’s optimization instability under a realistic compute budget. Second, the attack’s strength depends on the attacker model — a stronger, less safety-tuned attacker LLM produces better injections, while a safety-tuned attacker sometimes refuses to generate them at all.

Why it matters

The interesting result is the ceiling, not the floor. On GPT-5, the best attacks reach only about 4.5–4.7% ASR, and GCG strings transferred from Qwen3-4B land below 1%. Universal injections that generalize to held-out task domains on the small model drop to 0% on GPT-5’s held-out domain. In other words, the cheap path — optimize an injection against an open model you control, then fire it at a frontier deployment — largely does not work today.

That is good news with an expiry date. It says model-agnostic, push-button injection is not here yet; it does not say agents are safe. Slack-style tasks were the most vulnerable surface (around 67% ASR on the small model), and even a plain instruction with no optimization scored ~25% there. Anyone running open-weight or smaller models in an agent loop over untrusted content is squarely in the exploitable range the paper measures.

Defenses

The paper’s own finding — frontier robustness plus poor cross-model transfer — is a reason to be deliberate about model choice for agents that read untrusted data, not a reason to relax. The durable mitigations are architectural and predate this work:

Treat tool output as data, never as instructions. Keep retrieved content out of the privileged instruction channel; AgentDojo exists precisely to test defenses built on this separation.
Authorize the action, not the text. Gate every consequential tool call (send, pay, share, delete) on the user’s original intent, with human confirmation for irreversible operations.
Constrain the blast radius. Least-privilege tool scopes, allow-listed recipients and per-session spend/scope limits turn a successful injection into a contained one.
Watch the high-risk surfaces first. Messaging and email tools showed the highest susceptibility — prioritize monitoring and guardrails there.
Re-test under optimization, not just static prompts. A defense that survives a hand-written injection can still fall to an adaptive, attacker-LLM-driven one; evaluate with automated red-teaming.

Status

Item	Detail
Publication	arXiv:2606.10525 v1, 9 June 2026
Authors	Hofer, Debenedetti, Tramèr (ETH Zurich)
Framework	AgentDojo (extended for white-box access)
Most robust model tested	GPT-5 (~5% ASR; transferred GCG <1%)
Most vulnerable surface	Slack-style messaging tasks (~67% ASR on Qwen3-4B)
Nature	Defensive measurement study — no exploit released