system: OPERATIONAL
← back to all hacks
RESEARCH LOW NEW

SEC-bench Pro: how well can AI agents really hunt bugs in V8 and SpiderMonkey?

A May 26, 2026 benchmark measures coding agents on long-horizon vulnerability discovery in real browser engines. Frontier models stay below 40% — and the gap matters for both attackers and defenders.

2026-06-15 // 6 min affects: coding-agents, claude-code, openai-codex, llm-security-agents

What is this?

“SEC-bench Pro: Can Language Models Solve Long-Horizon Software Security Tasks?” (arXiv:2605.26548, 26 May 2026) is a benchmark that asks a question every security team now has to answer: when you point a coding agent at a real, large codebase and tell it to find a vulnerability, how often does it actually succeed? The answer, on the two targets the authors chose — Google’s V8 and Mozilla’s SpiderMonkey JavaScript engines — is “less often than the marketing implies.”

It extends the original SEC-bench (NeurIPS 2025) from short, well-scoped tasks to long-horizon ones: multi-step bug hunting on browser-grade software, where the agent has to navigate a million-line codebase, form a hypothesis, build a proof-of-concept, and confirm the crash — with no fuzzing harness handed to it and no description pointing at the bug. That realism is the whole point. The authors argue that earlier benchmarks overstate model ability because they lean on target-specific hints or simple reproduction tasks.

How it works

SEC-bench Pro is instantiated with 183 validated vulnerabilities spanning memory-safety, sandbox-escape, JIT, and race-condition bugs — categories that map to how browser engines actually get broken. The V8 subset alone represents more than $1.5 million in cumulative Google Vulnerability Reward Program payouts, so these are not toy bugs: they are issues real researchers were paid handsomely to find.

Each task runs under browser-grade or runtime-grade execution conditions, and the agent is scored on whether it can discover and reproduce the issue end to end. Crucially, the harness measures the long-horizon workflow rather than a single retrieval or patch step, which is where current agents tend to fall apart.

The headline results, as reported in the paper:

# Reported pass rates (higher = better), per the SEC-bench Pro paper
Open-weight baseline (Kimi-K2.6)      V8: 11.7%
Strongest single frontier config      V8: 32.0%   SpiderMonkey: 38.8%
ClaudeCode + Codex (two-agent union)  V8: 37.9%   SpiderMonkey: 48.8%

Two findings stand out. First, every configuration stays below 40% on each engine individually — frontier coding agents are far from reliable autonomous bug hunters on hard targets. Second, ClaudeCode and Codex solve complementary sets of instances: their union beats either alone (a roughly 18% lift over the best single scaffold on V8 and ~26% on SpiderMonkey, per the authors). Different scaffolds find different bugs.

Why it matters

This is a capability-measurement paper, not an attack, but the numbers cut both ways and both audiences should read them carefully.

For attackers, the result is sobering rather than alarming: a single off-the-shelf agent is not going to autonomously mine $1.5M-class V8 bugs today. The hype around “AI finds zero-days at scale” is, on this evidence, ahead of reality for the hardest targets — consistent with what we saw in AI-authored zero-day fingerprints and the exploit-capability ladder.

For defenders, the complementarity finding is the actionable part. If you are using coding agents for proactive vulnerability discovery, a single model is leaving bugs on the table; an ensemble of differently-scaffolded agents materially raises coverage. And the sub-40% ceiling is a reminder that AI bug-hunting augments human review — it does not replace it. The trajectory also matters: today’s ceiling is not tomorrow’s, and the same long-horizon competence that helps blue teams will help offense, which is why honest, fuzzing-harness-free benchmarks like this one are worth tracking. The broader difficulty of measuring these agents fairly is exactly the problem flagged in benchmarking security agents is hard.

Defenses

Concrete takeaways for teams adopting AI for security work:

  • Treat AI bug-hunting as augmentation, not autonomy. A below-40% pass rate on hard targets means humans must triage, confirm, and own the findings. Wire agent output into your existing review process, not around it.
  • Run ensembles, not a single model. Because ClaudeCode and Codex find complementary bugs, deploying multiple differently-scaffolded agents raises coverage more than upgrading any one of them. Diversity of scaffold beats single-model maximalism.
  • Benchmark on your own code, harness-free. SEC-bench Pro’s lesson is that hints and harnesses inflate scores. Evaluate vendors and internal tooling on realistic, hint-free tasks before trusting a “finds vulnerabilities automatically” claim.
  • Plan for the curve, not the snapshot. Build your detection, disclosure, and patch-prioritization assuming agent capability keeps rising — the defensive value of agentic red-teaming that compresses weeks into hours applies symmetrically to attackers.

Status

ItemValue
PublicationarXiv:2605.26548, 26 May 2026
TargetsV8 and SpiderMonkey (183 validated vulnerabilities)
Bug classesmemory-safety, sandbox, JIT, race-condition
Best single config32.0% (V8) / 38.8% (SpiderMonkey)
Two-agent union37.9% (V8) / 48.8% (SpiderMonkey)
NatureCapability benchmark — no exploit released

This is published, peer-reviewable research with a public project page; it documents a capability ceiling, not an unpatched product flaw. The useful number to carry forward: on genuinely hard, harness-free targets, the best AI agents today solve well under half the bugs — and the way to do better is more diverse agents plus human review, not blind trust in one model.

Sources