system: OPERATIONAL
← back to all hacks
SUPPLY CHAIN MEDIUM NEW

MalSkillBench: we can't measure malicious-skill detectors because the test data is biased

A June 2026 paper builds the first runtime-verified benchmark of malicious agent skills — 3,944 samples across 108 attack cells — and shows a single detector's recall can swing 66 points depending on which dataset you test it on.

2026-06-15 // 7 min affects: claude-code, gemini-cli, opencode, cursor, agent-skills

What is this?

In June 2026, researchers released MalSkillBench (arXiv:2606.07131), described as the first runtime-verified benchmark of malicious agent skills. The paper’s contribution is not a new attack — it is a measurement result that should worry anyone relying on a skill scanner: we currently cannot tell which detectors actually work, because the data they are tested on is small, narrow, and never verified against real behavior.

A skill is the unit of extensibility for AI coding agents like Claude Code, Gemini CLI, OpenCode, and Cursor. It is a markdown package (a SKILL.md) that bundles natural-language instructions, executable scripts, and tool permissions. Because a skill is at once executable code and agent-facing instruction, it is a software supply-chain dependency whose risk is “neither pure code nor pure prompt.” The ecosystem has scaled fast: the paper cites marketplaces hosting over 1.6 million skills and one registry listing 64,000+, with daily submissions on major platforms surging from under 50 to over 500 within weeks.

How it works

The core problem MalSkillBench documents is evaluation bias, and it comes from three gaps. First, there is no public ground truth at scale — industry reports publish counts but withhold the samples, and the largest prior public academic dataset totals only 157 samples. Second, the wild data that does exist covers a narrow slice of the attack surface: among 703 wild malicious skills the authors collected, 86.3% are dependency-impersonation attacks (a skill declares a trusted-looking prerequisite the agent is induced to install), while prompt-injection attacks are nearly absent. Third, there is no shared evaluation methodology — every detector is measured on its own private dataset under its own metrics.

To fix this, the authors build a closed-loop Generate-Verify-Feedback pipeline. Each candidate skill is loaded by a real coding agent inside a Docker sandbox instrumented with strace and inotifywait, plus an LLM judge; only samples whose observed behavior matches their declared ground-truth indicator are admitted. Verification turns a label claim into behavioral evidence. Coverage is organized by a three-dimensional taxonomy — attack vector × malicious behavior × insertion strategy — yielding 108 cells:

Dimension          Values (defines the 108-cell attack space)
-----------------  ------------------------------------------------------
Attack vector      CI  (Code Injection — payload in scripts/inline code)
                   PI  (Prompt Injection — adversarial markdown instructions)
                   MIXED (chain split across markdown + script)
Insertion (CI)     New Script File, Function Append, Function Inject,
                   Inline Code Block
Insertion (PI)     Full Camouflage, Partial Injection, Steganographic
                   (HTML comments, zero-width chars, homoglyphs)
Insertion (MIXED)  Download+Execute, Config+Load, Fetch+Run

No payloads are reproduced here, and none are needed to grasp the point: the same malicious routine (e.g. a reverse shell) can move between a Python package’s setup.py and a skill’s scripts/calculate.py with only the carrier and trigger changed. The MIXED case is the most evasive — the markdown stages an artifact and a separate script consumes it, so neither layer looks fully malicious on its own.

Why it matters

The headline finding is about trust in tooling, not a single CVE. When the authors evaluate 12 detection tools on a fair, unified harness, a single detector’s recall swings by 66 points depending on which subset of skills it is measured against — enough to flip an off-the-shelf antivirus aggregator from near the bottom of the field to the top. In other words, a vendor benchmark run on wild-only data (overwhelmingly dependency-impersonation) can make a detector look excellent while it remains blind to prompt-injection and mixed-vector skills that barely appear in the wild yet.

That gap matters because the attack surface is genuinely hybrid. A skill can carry a clean-looking markdown body whose only job is to make a downstream script do something harmful — data exfiltration, credential harvesting, or persistence. Steganographic insertion (HTML comments, zero-width characters, homoglyphs) defeats human review of the SKILL.md, and the executable half defeats prompt-only scanners. The paper is research on a benchmark and a measurement methodology, not a report of an in-the-wild breach — but it lands against a backdrop of documented skill-marketplace poisoning and a supply chain that is already at internet scale.

Defenses

MalSkillBench is itself a defensive contribution — better measurement is a prerequisite for better detection. The transferable mitigations:

  1. Treat third-party skills as untrusted software dependencies, not configuration. A SKILL.md ships executable scripts and tool permissions. Apply the same review, pinning, and provenance controls you would to an npm package or IDE extension — and remember that a clean-looking markdown body can still drive a malicious script.
  2. Don’t trust a detector’s marketing numbers; ask what it was measured on. If a scanner reports high recall, ask whether it was tested across both code-injection and prompt-injection vectors and across mixed-vector skills — not just the dependency-impersonation pattern that dominates wild datasets. Recall on a skewed set says little about real coverage.
  3. Run skills under runtime monitoring, not static scanning alone. The benchmark’s own ground truth comes from sandboxed execution with system-call monitoring. Defenders can borrow the idea: execute or stage new skills in an instrumented sandbox and watch for the behaviors (network egress, credential reads, persistence) before granting them production tool access.
  4. Layer controls so no single check is load-bearing. Static scanners miss runtime behavior; prompt scanners miss executable payloads; human review misses steganographic text. Combine least-privilege tool permissions, egress gating, and behavioral monitoring so a skill that slips one layer is caught by another.

Status

ItemReferenceDateNotes
MalSkillBench paperarXiv:2606.071312026-06First runtime-verified benchmark of malicious agent skills
Dataset scaleMalSkillBench2026-063,944 malicious skills (703 wild + 3,214 generated + 27 curated) + 4,000 benign
TaxonomyMalSkillBench2026-06Attack vector × behavior × insertion = 108 cells
Key measurementMalSkillBench2026-06A single detector’s recall swings 66 points across subsets; 12 tools evaluated
Wild-data biasMalSkillBench2026-0686.3% of 703 wild samples are dependency-impersonation; PI nearly absent

The durable lesson is methodological: in a fast-moving supply chain, what a detector catches on yesterday’s wild samples tells you little about what it will catch tomorrow. Measure detectors against the full attack space — verified by behavior, not by labels — and treat agent skills with the same suspicion you already apply to any third-party dependency.

Sources