system: OPERATIONAL
← back to all hacks
DEFENSE MEDIUM NEW

SkillVetBench: an LLM-as-Judge that catches what skill scanners miss

A June 14, 2026 arXiv paper shows code-layer skill scanners miss 89–100% of instruction-layer threats, while an LLM-as-Judge flags all 78 malicious test skills with zero false positives.

2026-06-18 // 6 min affects: llm-agents, agent-skills, skill-marketplaces, coding-agents

What is this?

SkillVetBench is a security-vetting system for AI-agent skills, published on arXiv on June 14, 2026 (arXiv:2606.15899) by Ismail Hossain, Sai Puppala, Md Jahangir Alam, Tanzim Ahad and Sajedul Talukder of the SUPREME Lab at the University of Texas at El Paso. It is the public-facing companion to a benchmark framework paper (arXiv:2606.00925) that supplies the labeled corpus behind its numbers.

A skill is the unit of extensibility for modern agents — a package that bundles natural-language instructions, scripts and tool/permission declarations (the SKILL.md family). Open marketplaces like ClawHub and OpenClaw now host tens of thousands of community-contributed skills. The paper’s central claim is uncomfortable: the scanners those marketplaces rely on are looking in the wrong layer. The danger in a malicious skill usually lives in its instructions, not in inspectable code — and that is exactly where code-layer tools are blind. This is the same supply-chain surface measured by MalSkillBench and exploited by semantic compliance hijacking.

How it works

SkillVetBench runs an LLM-as-Judge over the full skill artifact (instructions, code, config, tool declarations — parsed into typed segments) and emits three deliberately separate signals per skill, never fused into one number:

  • A three-tier verdict — Benign / Suspicious / Malicious — with a confidence score in [0,1].
  • A full CVSS v4.0 vector, where the judge assigns the categorical metric values (AV, PR, VC, etc.) and the standard scoring function computes the number deterministically.
  • SARS (Skill Agentic Risk Score), an original 0–10 metric built for instruction-following systems.

SARS scores five 0–3 dimensions and weights them:

SARS = (2·IFR + 1.5·DG + 1.5·AI + 2·BR + 2·CA) / 2.7   # 0–10

IFR  Instruction Fidelity Risk  — can user text override the skill's own instructions?
DG   Data Gravity               — public → internal → confidential → restricted secrets
AI   Action Irreversibility     — read-only → reversible → hard → DELETE/send/pay
BR   Blast Radius               — self → team → platform → cross-system / wormable
CA   Chain Amplification        — does it become a force multiplier chained with other skills?

IFR, BR and CA carry the heavier 2× weight precisely because they are the axes a code scanner cannot see — instruction hijackability, how far harm spreads, and whether a skill amplifies harm in a multi-skill chain. Labels: Critical ≥9.0, High 7.0–8.9, Medium 4.0–6.9, Low <4.0.

The point of keeping CVSS and SARS apart is that the gap between them is the signal. In the paper’s data, a “Data Exposure” skill scores only CVSS 1.84 and “Supply Chain” only 2.30 — both below the Medium band — yet each carries elevated SARS dimensions and earns a Suspicious verdict. The same artifact that looks harmless as code is dangerous as an agentic actor.

Crucially, the authors are honest about scope: the judge reads static content only. It does not execute the skill, sandbox-run code, or observe runtime arguments. Every verdict is an assessment of declared and inspectable behavior, not observed behavior.

Why it matters

The headline result is the measurement of detector blindness. On a controlled 100-skill set (78 confirmed-malicious, 22 benign), the LLM-as-Judge stage reached zero false negatives and zero false positives, while the best static baseline (SkillSieve) still missed 15%. For instruction-layer categories the gap is a chasm: conventional tools missed between 89% and 100% of threats, and CodeBERT detected none of nine memory-poisoning skills. If your skill-review pipeline is a signature matcher or a static analyzer, it is structurally incapable of seeing the most common agent-skill attacks — prompt injection and memory poisoning — no matter how well-tuned it is.

The result that keeps this honest: detection rate swung from 35% to 95% across four different LLM evaluators. An LLM judge is not a silver bullet; a weak judge is its own blind spot. That variance is the paper’s argument for ensemble scoring rather than trusting any single model’s verdict.

Defenses

Vet skills at the instruction layer, not just the code layer. Treat every third-party skill as both executable code and an instruction that steers your agent. A clean static scan tells you almost nothing about prompt-injection or memory-poisoning risk. Add a semantic review stage that reads the natural-language directives, the declared tools, and the permissions together.

Score for agentic blast radius, not just CVSS. Borrow the SARS axes even if you don’t adopt the tool: for each skill, ask whether user text can override its instructions (IFR), how sensitive the data it touches is (DG), whether its actions are reversible (AI), how far harm propagates (BR), and whether it amplifies when chained (CA). A low CVSS score on a skill with high blast radius and chain amplification is a trap, not an all-clear.

Use an ensemble of judges, and record which model judged. Given the 35–95% spread across evaluators, route high-stakes skills through more than one model and treat disagreement as a signal to escalate to human review. Pin and log the evaluator identity so verdicts stay comparable over time.

Keep a human gate for High/Critical and don’t trust static-only verdicts. Because the judge reads static content only, a determined attacker can still hide behavior behind runtime conditions. Pair semantic vetting with runtime controls — deny-by-default permissions and runtime monitoring as in SkillGuard — so a skill that slips past review still can’t act outside its declared scope.

Don’t trust a marketplace’s “approved” badge. SkillVetBench’s ClawHub dual-view exists precisely because official marketplace verdicts and an independent review frequently disagree. Re-vet skills yourself before installation, and re-vet on every version bump.

Status

ItemReferenceDateNotes
Paper (leaderboard)arXiv:2606.158992026-06-14LLM-as-Judge, SARS + CVSS v4.0 + verdict
Companion benchmarkarXiv:2606.009252026-06Labeled corpus / detection results
Controlled detection78 malicious + 22 benignLLM-as-Judge: 0 FN, 0 FP
Best static baselineSkillSieveStill misses 15%
Instruction-layer gapPrompt injection, memory poisoningConventional tools miss 89–100%
Evaluator variance4 LLM judgesDetection 35–95% → ensemble advised
Scope limitStatic analysis onlyNo execution / runtime observation

The takeaway is not “use this leaderboard.” It is that skill vetting has to move to the instruction layer — and that no single judge, human or model, should be the only gate in front of code your agent will run.

Sources