Cyber Defense Benchmark: frontier LLMs flunk open-ended threat hunting
An April 2026 benchmark drops five frontier models into raw Windows logs and asks them to hunt. The best finds 3.8% of malicious events — none clears the bar for unsupervised SOC work.
What is this?
A recurring pitch in security tooling is the autonomous SOC analyst: point an LLM agent at your logs and let it hunt. A new benchmark tests that pitch directly — and the result is a clean miss.
On April 21, 2026 (last revised April 23, 2026), Alankrit Chona, Igor Kozlov and Ambuj Kumar posted Cyber Defense Benchmark: Agentic Threat Hunting Evaluation for LLMs in SecOps (arXiv:2604.19533). It measures how well LLM agents perform the core SOC task of threat hunting: given a database of raw Windows event logs with no guided questions and no hints, find the exact timestamps of the malicious events.
This is harder than the curated multiple-choice security quizzes that LLMs already score well on. There is no question to answer — only a haystack and an instruction to find the needles. Across five frontier models, every one failed badly.
How it works
The benchmark wraps 106 real attack procedures from the open-source OTRF Security-Datasets corpus — spanning 86 MITRE ATT&CK sub-techniques across 12 tactics — into a reinforcement-learning environment.
Each episode works like this, per the paper:
1. A deterministic campaign simulator replays a real attack,
time-shifting timestamps and obfuscating entity names so the
agent cannot memorize the public recording.
2. The agent gets an in-memory SQLite database of
75,000-135,000 log records (mostly benign background noise).
3. The agent iteratively submits SQL queries to investigate,
then explicitly flags the timestamps it believes are malicious.
4. Flags are scored CTF-style against ground truth derived
from Sigma detection rules.
The use of Sigma rules — a vendor-agnostic detection format mapped to ATT&CK — as ground truth means the agent is graded against what a competent human detection engineer would actually flag, not a synthetic key.
The models tested were Claude Opus 4.6, GPT-5, Gemini 3.1 Pro, Kimi K2.5 and Gemini 3 Flash, run on 26 campaigns covering 105 of the 106 procedures.
Why it matters
The headline numbers are stark. The best model, Claude Opus 4.6, submitted correct flags for only 3.8% of malicious events on average. No run by any model ever found all the flags in an episode.
The authors define a sensible deployment bar: ≥50% recall on every ATT&CK tactic — the minimum you would want before letting an agent hunt unsupervised. No model passed. The leader cleared the bar on just 5 of 13 tactics; the other four models cleared it on zero.
The gap that matters is between this and the polished benchmarks vendors quote. LLMs look strong on curated, hint-rich security Q&A. Drop the same models into an open-ended, evidence-driven hunt over noisy logs and the performance collapses. The skill being measured — patient, iterative pivoting through a large corpus to assemble weak signals into a confirmed finding — is exactly what a SOC analyst does, and exactly what curated benchmarks fail to capture.
For anyone weighing an “AI threat hunting” product, this is a concrete reason to demand evaluation on open-ended tasks, not leaderboard trivia.
Defenses
This is a defensive-readiness finding, so the “defense” is how to deploy LLMs in a SOC without overtrusting them.
- Don’t run autonomous hunting unsupervised. On this evidence, an LLM agent left to find malicious events on its own will miss the large majority of them. Keep a human analyst in the loop for any hunt that gates a response.
- Use LLMs where they’re actually strong. Summarizing an alert, drafting a query, explaining a Sigma rule, triaging a already-detected event — narrow, bounded tasks — are very different from open-ended discovery. Scope the tool to those.
- Benchmark on your own open-ended tasks. Vendor accuracy on curated Q&A tells you little about hunting. Replay real attack data (the OTRF corpus is public) and measure recall per ATT&CK tactic before trusting any agent.
- Treat recall, not precision, as the safety metric. A hunter that misses 96% of events is dangerous even if everything it does flag is correct. Measure what it failed to find.
- Layer deterministic detection underneath. Sigma rules and signature-based detection caught these events by construction. LLM agents should sit on top of reliable detection engineering, not replace it.
These points reinforce the season’s broader caution that benchmarking security agents is hard and that a single headline number hides the operating point you’ll actually run at.
Status
| Item | Reference | Date | Notes |
|---|---|---|---|
| Cyber Defense Benchmark | arXiv:2604.19533 | 2026-04-21 (v1) → 2026-04-23 (v3) | 106 procedures, 86 ATT&CK sub-techniques, 12 tactics |
| Best result | Claude Opus 4.6 | 2026 | 3.8% of malicious events flagged; passes 5/13 tactics |
| Other models | GPT-5, Gemini 3.1 Pro, Kimi K2.5, Gemini 3 Flash | 2026 | Clear the deployment bar on zero tactics |
| Ground truth | OTRF Security-Datasets + Sigma rules | ongoing | Public corpus; results reproducible |
The takeaway is not that LLMs are useless in a SOC — it’s that open-ended threat hunting is not yet a task you can hand off. Measure for it before you trust it.