RESEARCH LOW NEW

AuditBench: LLMs investigating real attacks are false-positive machines

A June 2026 benchmark tests five frontier LLMs on real audit-log investigations. Verdict: overly suspicious models, many false positives — and smaller models often match the big ones.

2026-06-11 // 6 min affects: gpt-5, gemini-2.5, llama-4

What is this?

Security vendors increasingly pitch LLMs as tireless SOC analysts that can read logs and find intrusions. AuditBench, a benchmark paper submitted to arXiv on June 9, 2026 (arXiv:2606.10281, Anand, Hou, Fields, Kantchelian, Tao, Thomas and Ho), is one of the first systematic attempts to measure whether that claim survives contact with real audit logs. The authors built a labeled dataset of Linux and Windows system logs spanning more than 50 investigation scenarios — both malicious and benign — and graded five frontier LLMs on four tasks incident-response teams actually perform: alert triage (attack classification), and identifying lateral movement, persistence mechanisms, and data exfiltration.

How it works

The benchmark combines two data sources. A lab portion contains 25 scenarios executed on virtual machines, with attack techniques drawn from MITRE ATT&CK and implemented via the Atomic Red Team framework. A second portion derives 16 attack scenarios from DARPA’s OpTC dataset. Every scenario carries manually verified ground-truth labels, so the evaluation pipeline can compute true-positive rate, false-positive rate, and F1 per task.

Two design choices matter for practitioners. First, logs are fed to the model in two representations: raw output from native collectors like Linux auditd, or a pre-processed edge representation built from a provenance graph. Second, the evaluated models split into larger (GPT-5, Gemini 2.5 Pro) and smaller (GPT-5 mini, Gemini 2.5 Flash, Llama 4 Maverick) tiers, allowing a direct test of whether scale buys investigation skill.

Why it matters

The headline result is sobering: performance is mixed across all tasks, with a strong tendency toward overly-suspicious verdicts — models flag benign activity as malicious, producing the very false-positive flood that already drowns SOC teams. An assistant that amplifies alert fatigue is not neutral: false positives are a known operational attack surface, and recent work on oversight capacity shows reviewers degrade fast under volume.

Second, no model strictly dominates, and — contrary to scaling expectations — the best small model frequently matches or beats the best large model. On the edge representation, smaller models reached F1 1.00 vs 0.77 for larger ones on classification, and 0.80 vs 0.57 on persistence; larger models kept the edge only on exfiltration (0.77 vs 0.56). Data representation and prompt construction moved results as much as model choice did.

Third, the paper grades explanation quality using a textual-entailment (NLI) judge: even when the verdict is right, the model’s stated reasoning is not always supported by the logs — a real risk when analysts paste LLM narratives into incident reports.

Defenses

For teams deploying LLM-assisted investigation, AuditBench’s findings translate into concrete guardrails:

Treat LLM verdicts as triage suggestions, never as dispositive findings. Keep a human analyst as the authority on closure decisions, and budget for the false-positive bias.
Invest in data representation before model size. A provenance-graph/edge preprocessing step changed outcomes more than upgrading to a larger model — and lets cheaper models do the job.
Verify explanations, not just verdicts. Require that any LLM-cited evidence (process names, log lines, timestamps) be mechanically checked against the actual logs before it enters a report.
Benchmark on your own telemetry. Per-task error profiles vary widely; a model strong on persistence may be weak on exfiltration. Build a small labeled scenario set (Atomic Red Team makes this cheap) and measure before trusting.
Harden the log pipeline itself. An LLM that reads logs is also a prompt-injection target — attacker-controlled strings in process arguments or filenames travel straight into the model’s context.

Status

Item	Detail
Paper	arXiv:2606.10281, submitted June 9, 2026
Scope	4 IR tasks, 50+ scenarios, Linux + Windows
Models tested	GPT-5, Gemini 2.5 Pro, GPT-5 mini, Gemini 2.5 Flash, Llama 4 Maverick
Key finding	Mixed performance, false-positive prone; small models competitive
Prior art	ExCyTIn-Bench (arXiv:2507.14201) for threat-investigation agents