RESEARCH LOW NEW

Cyber Defense Benchmark: frontier LLMs flunk open-ended threat hunting

An April 2026 benchmark drops five frontier models into raw Windows logs and asks them to hunt. The best finds 3.8% of malicious events — none clears the bar for unsupervised SOC work.

2026-06-15 // 6 min affects: claude-opus-4.6, gpt-5, gemini-3.1-pro, kimi-k2.5, gemini-3-flash

What is this?

A recurring pitch in security tooling is the autonomous SOC analyst: point an LLM agent at your logs and let it hunt. A new benchmark tests that pitch directly — and the result is a clean miss.

On April 21, 2026 (last revised April 23, 2026), Alankrit Chona, Igor Kozlov and Ambuj Kumar posted Cyber Defense Benchmark: Agentic Threat Hunting Evaluation for LLMs in SecOps (arXiv:2604.19533). It measures how well LLM agents perform the core SOC task of threat hunting: given a database of raw Windows event logs with no guided questions and no hints, find the exact timestamps of the malicious events.

This is harder than the curated multiple-choice security quizzes that LLMs already score well on. There is no question to answer — only a haystack and an instruction to find the needles. Across five frontier models, every one failed badly.

How it works

The benchmark wraps 106 real attack procedures from the open-source OTRF Security-Datasets corpus — spanning 86 MITRE ATT&CK sub-techniques across 12 tactics — into a reinforcement-learning environment.

Each episode works like this, per the paper:

1. A deterministic campaign simulator replays a real attack,
   time-shifting timestamps and obfuscating entity names so the
   agent cannot memorize the public recording.
2. The agent gets an in-memory SQLite database of
   75,000-135,000 log records (mostly benign background noise).
3. The agent iteratively submits SQL queries to investigate,
   then explicitly flags the timestamps it believes are malicious.
4. Flags are scored CTF-style against ground truth derived
   from Sigma detection rules.

The use of Sigma rules — a vendor-agnostic detection format mapped to ATT&CK — as ground truth means the agent is graded against what a competent human detection engineer would actually flag, not a synthetic key.

The models tested were Claude Opus 4.6, GPT-5, Gemini 3.1 Pro, Kimi K2.5 and Gemini 3 Flash, run on 26 campaigns covering 105 of the 106 procedures.

Why it matters

The headline numbers are stark. The best model, Claude Opus 4.6, submitted correct flags for only 3.8% of malicious events on average. No run by any model ever found all the flags in an episode.

The authors define a sensible deployment bar: ≥50% recall on every ATT&CK tactic — the minimum you would want before letting an agent hunt unsupervised. No model passed. The leader cleared the bar on just 5 of 13 tactics; the other four models cleared it on zero.

The gap that matters is between this and the polished benchmarks vendors quote. LLMs look strong on curated, hint-rich security Q&A. Drop the same models into an open-ended, evidence-driven hunt over noisy logs and the performance collapses. The skill being measured — patient, iterative pivoting through a large corpus to assemble weak signals into a confirmed finding — is exactly what a SOC analyst does, and exactly what curated benchmarks fail to capture.

For anyone weighing an “AI threat hunting” product, this is a concrete reason to demand evaluation on open-ended tasks, not leaderboard trivia.

Defenses

This is a defensive-readiness finding, so the “defense” is how to deploy LLMs in a SOC without overtrusting them.

Don’t run autonomous hunting unsupervised. On this evidence, an LLM agent left to find malicious events on its own will miss the large majority of them. Keep a human analyst in the loop for any hunt that gates a response.
Use LLMs where they’re actually strong. Summarizing an alert, drafting a query, explaining a Sigma rule, triaging a already-detected event — narrow, bounded tasks — are very different from open-ended discovery. Scope the tool to those.
Benchmark on your own open-ended tasks. Vendor accuracy on curated Q&A tells you little about hunting. Replay real attack data (the OTRF corpus is public) and measure recall per ATT&CK tactic before trusting any agent.
Treat recall, not precision, as the safety metric. A hunter that misses 96% of events is dangerous even if everything it does flag is correct. Measure what it failed to find.
Layer deterministic detection underneath. Sigma rules and signature-based detection caught these events by construction. LLM agents should sit on top of reliable detection engineering, not replace it.

These points reinforce the season’s broader caution that benchmarking security agents is hard and that a single headline number hides the operating point you’ll actually run at.

Status

Item	Reference	Date	Notes
Cyber Defense Benchmark	arXiv:2604.19533	2026-04-21 (v1) → 2026-04-23 (v3)	106 procedures, 86 ATT&CK sub-techniques, 12 tactics
Best result	Claude Opus 4.6	2026	3.8% of malicious events flagged; passes 5/13 tactics
Other models	GPT-5, Gemini 3.1 Pro, Kimi K2.5, Gemini 3 Flash	2026	Clear the deployment bar on zero tactics
Ground truth	OTRF Security-Datasets + Sigma rules	ongoing	Public corpus; results reproducible

The takeaway is not that LLMs are useless in a SOC — it’s that open-ended threat hunting is not yet a task you can hand off. Measure for it before you trust it.