system: OPERATIONAL
← back to all hacks
RED TEAM MEDIUM NEW

Agentic red teaming: when one operator runs 674 attacks in three hours

A May 2026 paper from Dreadnode wraps the AI red-team toolkit in an agent that picks attacks, runs them, and scores results autonomously — compressing weeks into hours. The real story is what that does to your assessment program.

2026-06-01 // 7 min affects: llama-scout

What is this?

On May 5, 2026, researchers Raja Sekhar Rao Dheekonda, Will Pearce and Nick Landers published Redefining AI Red Teaming in the Agentic Era: From Weeks to Hours (arXiv:2605.04019). The paper describes an AI red-teaming agent built on the open-source Dreadnode SDK that turns a natural-language objective into executed adversarial tests, with the human operator describing what to probe rather than implementing how.

The numbers from the case study are the headline. Pointed at Meta’s Llama Scout, the agent ran, per the authors, 674 attacks across roughly 681 assessments and 7,727 trials in about three hours of wall-clock time, reaching an 85% attack success rate “using zero human-developed code.” Help Net Security covered the work on May 21, 2026, adding direct comment from the authors. This is a research and tooling story, not a new attack — every technique the agent uses is already public. What is new is the orchestration layer sitting on top of them, and what that does to the economics of adversarial testing.

How it works

The agent wraps a catalogue the paper puts at 45+ adversarial attacks, 450+ transforms, and 130+ scorers. The operator states a goal in plain language through a terminal UI; the agent then handles the loop that a human used to assemble by hand:

Operator: "Probe target X for harmful-content and bias failures."
   |
   v
Agent  -> selects attack strategies (e.g. multi-turn, persona framing,
          graph-/tree-based search over prompts)
       -> composes transforms (encoding, translation to low-resource
          languages, role-play wrappers)
       -> executes against the target
       -> scores each result with an LLM-as-judge
       -> maps findings to OWASP LLM Top 10 / MITRE ATLAS / NIST AI RMF
       -> emits structured findings + compliance tags

In the Llama Scout study the authors report that multi-turn techniques such as Crescendo and a search method called Graph of Attacks with Pruning reached a 100% success rate, persona/“skeleton-key” framings also hit 100%, while a simpler Base64 encoding transform landed lower at around 75%. The orchestrating model itself was Moonshot AI’s Kimi 2.5, used as both attacker and judge — a deliberate choice, because highly aligned frontier models often refuse to compose offensive workflows, treating the operator’s legitimate red-team objective as a harmful request.

No payloads are reproduced here. The point worth internalising is structural: this is the same shift toward autonomous capability ladders seen elsewhere — orchestration, not invention, is where the change happens.

Why it matters

Read the throughput number carefully before you react to it. Several caveats sit in the paper’s own limitations and in the authors’ comments to Help Net Security:

  • The three-hour figure covers a focused slice. The authors note that comprehensive assessments across all attack and harm categories run closer to days.
  • Llama Scout is a mid-size open model (17B, released April 2025). An 85% success rate there says little about current frontier systems.
  • The authors confirmed they did not coordinate disclosure with Meta before publishing verbatim outputs, and have not checked whether later Llama Scout checkpoints mitigate the specific combinations found.
  • Humans still win on long-horizon reasoning, contextual social engineering, and novel exploit chains, per co-author Dheekonda.

So the significance is not “AI out-hacks humans.” It is accessibility and scale. Composition work that previously required scripting expertise now runs with far lower overhead, which lowers the floor for defenders and motivated actors alike. As the authors frame it, the meaningful question is no longer whether these techniques exist publicly — they do — but whether defenders can continuously probe their own systems before adversaries do. The companion Dreadnode write-up frames the same run as “232 critical vulnerabilities … in 3 hours, with zero code.”

Defenses

The defensive takeaways are about your program, not a patch.

  1. Adopt continuous assessment, but own the triage. When one operator can run hundreds of attacks in an afternoon, annual or quarterly red-team engagements stop reflecting reality. The scarce skill moves up the stack — from workflow engineering to deciding which of several hundred automated findings is a real risk in your deployment context.

  2. Distrust raw finding counts. A dashboard reporting “232 critical findings” with automatic compliance tags is easy to mistake for security. Build an explicit process for what gets remediated, what is accepted as known risk, and what is a scorer artifact rather than a genuine vulnerability. LLM-as-judge scoring has its own false-positive rate.

  3. Tier results by target realism. A high success rate against a mid-size open model is not evidence about your hardened, frontier-backed production stack. Re-run against the actual model, version, and system-prompt configuration you ship — and date every observation, because behaviour drifts between checkpoints.

  4. Build detection for agentic red-team traffic. Agentic assessment closely resembles agentic attacker activity. Detection tooling for this pattern — bursts of transformed prompts, multi-turn escalation like Crescendo-style sequences, automated retries — is still underdeveloped. Instrument your own LLM endpoints for it now.

  5. Prioritise the techniques that scored highest. The case study suggests multi-turn and persona-framing attacks generalise more reliably than encoding tricks. Defenses such as input/output classifiers, constrained tool access, and multi-turn-aware guardrails should be evaluated against those families specifically, not just single-shot payloads.

  6. Mirror the workflow defensively. The same orchestration can run against your own assets continuously — pre-deployment gates, regression tests after model upgrades, coverage mapped to OWASP LLM Top 10 and MITRE ATLAS. Treat agentic red teaming as a blue-team capability, not only a threat.

Status

ItemReferenceDateNotes
Paper published (arXiv:2605.04019)arXiv2026-05-0539 pages; cs.AI / cs.CR
Dreadnode research write-upDreadnode2026-05-06”232 critical vulnerabilities in 3 hours, zero code”
Press coverage + author commentHelp Net Security2026-05-21Confirms no coordinated disclosure with Meta
Target modelMeta Llama Scoutreleased 2025-0417B; 85% ASR in case study
Orchestrating modelMoonshot AI Kimi 2.5Used as attacker + judge to avoid refusal

The right framing is not “an agent broke an LLM in three hours” — that headline is already routine. It is that the operational cost of adversarial testing is collapsing, for both sides, and that defenders who still treat red teaming as an occasional event are about to be out-paced by anyone who treats it as a continuous one.

Sources