SkillAttack: automated red-teaming finds exploits in agent skills
An April 2026 paper, SkillAttack, reframes exploit discovery as a path-search problem and shows even well-intentioned agent skills are reachable — up to 0.93 attack success on adversarial skills.
What is this?
Published in April 2026, SkillAttack: Automated Red Teaming of Agent Skills through Attack Path Refinement (arXiv:2604.04989) targets a fast-growing piece of the agent stack: skills. A skill is a reusable bundle of natural-language instructions plus optional code that an agent loads to perform a task — increasingly distributed through marketplaces and registries, and increasingly trusted by the agents that install them.
SkillAttack is a red-teaming framework, not a new attack. It probes whether a skill can be coerced into unsafe behavior using only adversarial prompting — it does not modify the target skill’s code or instructions. The framework’s premise is that a skill’s published surface (its declared instructions and tool calls) already encodes one or more reachable paths to an unsafe action, and that an attacker need only find the user input that walks the agent down that path. This is a structured cousin of the skill-supply-chain risks already documented in malicious agent skills and skill-registry supply chains, but the contribution here is the search method.
How it works
The key idea is to formulate exploit discovery as a path-search problem. Each “attack path” describes an expected execution flow — from a user prompt, through the skill’s logic, to an unsafe behavior — and the framework iteratively refines candidate paths until one converges on a working exploit. It runs in three stages:
- Skill vulnerability analysis — audit the skill’s code and instructions to enumerate attack surfaces.
- Surface-parallel attack generation — for each surface, infer an attack path and construct a corresponding adversarial prompt, in parallel across surfaces.
- Feedback-driven exploit refinement — execute the prompt against the agent, collect execution feedback, and iteratively refine both the path and the prompt, forming a closed loop that progressively converges toward success.
No payload is reproduced here. The lesson is the shape of the technique — treat a skill’s declared surface as a map of reachable bad states, then let execution feedback guide a search toward them — which is enough to understand the risk without a working exploit. The closed-loop refinement is what separates this from one-shot adversarial prompting: failures become signal, not dead ends, echoing the broader shift toward agentic red-teaming measured in hours, not weeks.
Why it matters
The authors evaluate SkillAttack across 10 LLMs on 71 adversarial skills and 100 real-world skills. Reported attack success rate (ASR) reaches 0.73–0.93 on adversarial skills and up to 0.26 on real-world skills.
Two takeaways. First, the gap between adversarial and real-world numbers is expected, but a 0.26 ASR on skills that were never designed to be malicious is the uncomfortable result: well-intentioned skills still expose reachable unsafe paths under realistic interaction. Second, because the method needs only prompting — no access to or modification of the skill — it models a realistic attacker who installs a popular skill and then drives it from the user turn. Any team shipping or consuming skills through a marketplace inherits this surface, and it compounds with the lethal trifecta: a skill that touches private data, ingests untrusted content, and can reach the network is a high-value target for exactly this kind of guided search.
Defenses
Treat skills as untrusted, privilege-bearing code, and test them the way attackers will.
- Red-team skills before publishing or installing. Run automated path-search red-teaming (the SkillAttack repo is open source) against a skill in a sandbox, and gate promotion on the results — don’t rely on manual review alone, which misses reachable paths.
- Least privilege per skill. Scope each skill’s tools, data, and network egress to the minimum it needs, in line with agent skill permissions. A narrow capability set shrinks the set of reachable unsafe states the search can find.
- Vet at admission, not just at authorship. Use independent vetting of declared instructions and code — see LLM-judge skill vetting and capability governance for verified skills — and re-vet on every version bump.
- Constrain the execution path. Require confirmation or human review before a skill performs irreversible or high-impact actions, so a converged exploit still hits a checkpoint before causing harm.
- Monitor at runtime. Log skill tool calls and watch for the iterative, feedback-seeking probing pattern that closed-loop refinement produces against a live agent.
Status
| Item | Detail |
|---|---|
| Disclosure | arXiv:2604.04989, April 2026 |
| Type | Red-teaming framework for agent skills (prompt-only, no skill modification) |
| Method | Exploit discovery as path search + closed-loop refinement |
| Evaluation | 10 LLMs; 71 adversarial skills; 100 real-world skills |
| Result | ASR 0.73–0.93 (adversarial), up to 0.26 (real-world) |
| Code | Open-source implementation released |