system: OPERATIONAL
← back to all hacks
PROMPT INJECTION MEDIUM NEW

Automated prompt injection is model-dependent: TAP beats GCG, GPT-5 resists

A June 9, 2026 ETH Zurich study adapts GCG and TAP to AgentDojo across 80 agent task pairs. Black-box TAP beats gradient-based GCG, yet attacks tuned on small models fail to transfer to GPT-5.

2026-06-25 // 6 min affects: qwen3-4b, gemma3-4b, gpt-5, gpt-5-mini, claude-sonnet-4.5, gemini-2.5-flash, qwen3-235b

What is this?

On June 9, 2026, three ETH Zurich researchers — David Hofer, Edoardo Debenedetti and Florian Tramèr — published Assessing Automated Prompt Injection Attacks in Agentic Environments (arXiv:2606.10525). It is the first systematic measurement of whether the automated attack methods that work for jailbreaking also work for indirect prompt injection (IPI) against tool-calling agents. The short answer: they work, but unevenly. Against small open-weight models the success rates are real; against a frontier model (GPT-5) they collapse, and attacks optimized on small models do not transfer up. Automated injection is a credible threat — but a strongly model-dependent one.

How it works

The team adapted two known jailbreak optimizers to the agentic setting inside AgentDojo, the standard benchmark for agents acting on untrusted data. The white-box method is GCG, which uses gradients to search for an adversarial token string; the black-box method is TAP, which uses an attacker LLM to iteratively rewrite an injection and prune dead ends. No payloads are reproduced here — the contribution is the measurement, not a recipe.

The evaluation spans 80 task pairs across four domains (workspace, banking, travel, slack). The headline numbers, on the small Qwen3-4B target:

Method (Qwen3-4B target)      Attack Success Rate
----------------------------  -------------------
Universal TAP (black-box)     45.2%
Single-task TAP               44.6%
Universal GCG (white-box)     24.1%
Single-task GCG               23.0%

Two structural findings stand out. First, black-box beats white-box: TAP roughly doubles GCG’s success, which the authors attribute to GCG’s optimization instability under a realistic compute budget. Second, the attack’s strength depends on the attacker model — a stronger, less safety-tuned attacker LLM produces better injections, while a safety-tuned attacker sometimes refuses to generate them at all.

Why it matters

The interesting result is the ceiling, not the floor. On GPT-5, the best attacks reach only about 4.5–4.7% ASR, and GCG strings transferred from Qwen3-4B land below 1%. Universal injections that generalize to held-out task domains on the small model drop to 0% on GPT-5’s held-out domain. In other words, the cheap path — optimize an injection against an open model you control, then fire it at a frontier deployment — largely does not work today.

That is good news with an expiry date. It says model-agnostic, push-button injection is not here yet; it does not say agents are safe. Slack-style tasks were the most vulnerable surface (around 67% ASR on the small model), and even a plain instruction with no optimization scored ~25% there. Anyone running open-weight or smaller models in an agent loop over untrusted content is squarely in the exploitable range the paper measures.

Defenses

The paper’s own finding — frontier robustness plus poor cross-model transfer — is a reason to be deliberate about model choice for agents that read untrusted data, not a reason to relax. The durable mitigations are architectural and predate this work:

  • Treat tool output as data, never as instructions. Keep retrieved content out of the privileged instruction channel; AgentDojo exists precisely to test defenses built on this separation.
  • Authorize the action, not the text. Gate every consequential tool call (send, pay, share, delete) on the user’s original intent, with human confirmation for irreversible operations.
  • Constrain the blast radius. Least-privilege tool scopes, allow-listed recipients and per-session spend/scope limits turn a successful injection into a contained one.
  • Watch the high-risk surfaces first. Messaging and email tools showed the highest susceptibility — prioritize monitoring and guardrails there.
  • Re-test under optimization, not just static prompts. A defense that survives a hand-written injection can still fall to an adaptive, attacker-LLM-driven one; evaluate with automated red-teaming.

Status

ItemDetail
PublicationarXiv:2606.10525 v1, 9 June 2026
AuthorsHofer, Debenedetti, Tramèr (ETH Zurich)
FrameworkAgentDojo (extended for white-box access)
Most robust model testedGPT-5 (~5% ASR; transferred GCG <1%)
Most vulnerable surfaceSlack-style messaging tasks (~67% ASR on Qwen3-4B)
NatureDefensive measurement study — no exploit released

Sources