FORGE: a multi-agent pipeline turning CVEs into exploits and detections
A June 2, 2026 paper from Dynatrace chains five LLM agents to take a CVE from advisory text to a working exploit attempt and a detection rule, scored on a four-level compromise ladder.
What is this?
On June 2, 2026, Dynatrace researcher Farooq Shaikh published FORGE: Multi-Agent Graduated Exploitation and Detection Engineering (arXiv:2606.03453), accepted at the AgentCy workshop of the ARES 2026 conference. The paper tackles a workflow problem rather than a single vulnerability: the volume of disclosed CVEs now far outstrips the capacity of any organization to assess them, while the three communities that should cooperate — proof-of-concept generation, vulnerability prioritization, and detection-rule engineering — largely work in isolation.
FORGE is an attempt to wire those three steps into one automated pipeline of LLM agents. The interesting part for defenders is not that an agent can write an exploit; it is that the same run is designed to emit a detection rule and a graded, audited verdict about how far the exploit actually got. The full code, per-CVE results, and oracle audit verdicts are released under Apache 2.0 at github.com/dynatrace-oss/forge.
How it works
FORGE runs five specialized agents in a fixed pipeline, each consuming the previous agent’s output:
- Intel — ingests a CVE’s metadata (advisory text, affected components) and extracts the structured facts the later stages need.
- Generator — builds a targeted vulnerable application: a minimal, self-contained app that reproduces the flaw, so the exploitation step runs against a purpose-built lab target rather than anyone’s live system.
- Planner — turns the intel into an exploitation strategy.
- Exploit — conducts a coached, multi-turn attempt against the generated target.
- Detector — engineers a detection rule from what the exploitation produced.
The novel piece is what the paper calls graduated exploitation: instead of a binary “did it pop or not,” an LLM-primary oracle scores each attempt on a four-level taxonomy, from L0 (no evidence of exploitation) through L3 (full compromise). Grading the depth of success, rather than just success or failure, is what lets the pipeline prioritize: a CVE that an agent can drive to L3 against a faithful reproduction is a very different operational signal from one that stalls at L1. Because exploitation only ever runs against the Generator’s synthetic target, the loop produces evidence and detections without touching production software.
Why it matters
FORGE sits in the same emerging category as recent work on agents that hunt or triage vulnerabilities — long-horizon bug-hunting benchmarks, spec-first vulnerability detection, and the sober finding that off-the-shelf agents still lose to deterministic SAST tools. What FORGE adds is the bridge from a vulnerability to a deployable detection, with a graded confidence signal attached.
That bridge cuts both ways. For a SOC, an automated path from “new CVE dropped” to “candidate detection rule plus a reproduction we can test it against” is genuinely useful, and compressing exploit-assessment from weeks to hours mirrors what agentic red-teaming work reports. But the same architecture is a template attackers can copy: a pipeline that reads an advisory and graduates toward a working exploit is exactly the industrialization of offense that threat-modeling papers have been forecasting. The open-source release lowers the barrier for both defenders and adversaries — which is the standard, unavoidable tension of published offensive-security tooling, and why the evaluation and oracle-audit transparency in the paper matters as much as the exploit results.
Defenses
For teams reading FORGE as defenders rather than tool-builders:
- Treat the detection output as a draft, not a deployable rule. An LLM-generated detection reflects one synthetic reproduction; validate it against real telemetry and tune for false positives before it gates anything, the same caution that applies to LLM-as-scanner output.
- Don’t trust the grade blindly. The L0–L3 verdict comes from an LLM-primary oracle; the paper releases its audit verdicts precisely so they can be checked. Use the graded score to prioritize human review, not to replace it.
- Assume attackers have the same pipeline. If a CVE in your stack is the kind FORGE-style agents can drive to L3 against a faithful clone, treat patch urgency accordingly — the assessment gap that used to buy you time is shrinking.
- Keep generation sandboxed. The design’s safety property is that exploitation only runs against purpose-built targets. Any internal adaptation of this pattern must preserve that isolation and never point the Exploit agent at live or shared infrastructure.
- Feed graded results into prioritization, not just ticket counts. The point of the four-level taxonomy is to separate theoretically-exploitable from demonstrably-exploitable; use that distinction to rank remediation.
Status
| Item | Reference | Date | Notes |
|---|---|---|---|
| Paper | arXiv:2606.03453 | 2026-06-02 | AgentCy workshop, ARES 2026 |
| Pipeline | Intel → Generator → Planner → Exploit → Detector | 2026-06-02 | Five agents, fixed order |
| Scoring | L0 (no evidence) → L3 (full compromise) | 2026-06-02 | LLM-primary oracle, audited |
| Code | github.com/dynatrace-oss/forge | 2026-06-02 | Apache 2.0, with artifacts repo |
The takeaway: FORGE is less a new attack than a demonstration that the CVE-to-exploit-to-detection loop can be automated end-to-end with current LLM agents — and graded for confidence along the way. For defenders, the practical value is faster, prioritized detection engineering; the matching risk is that the same blueprint compresses the offensive timeline just as much, so patch-prioritization assumptions built on slow exploit development deserve a second look.