Off-the-shelf LLM agents fail at SAST scanning, empirical test finds
A June 10, 2026 study pitted a local LLM agent against the Bandit SAST tool on 101,816 lines of Python. Every model scored a negative composite, dominated by hallucinated findings.
What is this?
On June 10, 2026, Derek Yohn, Luke Flancher, Mirajul Islam and Khaled Slhoub of the Florida Institute of Technology published Can Open-Source LLM Agents Replace Static Application Security Testing Tools? An Empirical Assessment (arXiv:2606.11672, cs.CR). The question matters because teams shipping AI-generated code increasingly want an AI agent to review it too, and many small teams never wired a real Static Application Security Testing (SAST) tool into their pipeline in the first place.
The authors built a simple Python agent called Snitch, powered in turn by three Ollama-hosted open-weight models — Google’s gemma3, Meta’s llama3.1, and Alibaba’s qwen2.5 — and benchmarked it against Bandit, a vetted, rule-based Python SAST tool, on three real open-source repositories. Their conclusion is blunt: a general-purpose open-source LLM agent is not currently suitable for SAST scanning under realistic conditions.
How it works
The setup is a clean head-to-head. Bandit and Snitch each scanned 224 Python files / 101,816 lines of code across Beaverhabits (a habit tracker), Fail2ban (a security daemon), and Yum (the mature RPM package manager). Bandit’s medium/high-severity findings were treated as the reference “source of truth” — 4 in Beaverhabits, 37 in Fail2ban, 0 in the mature Yum codebase. Snitch ran a fixed expert-SAST prompt asking each model to emit JSON with where, what, why, and fix fields.
Performance was scored as recall (fraction of Bandit’s M/H findings the model also caught, with partial credit) minus a false-positive rate, giving a composite from -1 to 1. Every cell came out negative:
| Model | Beaverhabits (R / FP / Comp) | Fail2ban | Yum |
|---|---|---|---|
| gemma3 | 0.25 / 0.88 / -0.62 | 0.00 / 0.80 / -0.80 | 0.00 / 0.89 / -0.89 |
| llama3.1 | 0.25 / 0.46 / -0.21 | 0.03 / 0.40 / -0.37 | 0.00 / 0.59 / -0.59 |
| qwen2.5 | 0.25 / 0.28 / -0.03 | 0.01 / 0.14 / -0.12 | 0.00 / 0.40 / -0.40 |
The dominant failure mode was hallucination. The where field — the file and line range of each issue — was frequently fabricated, pointing at non-existent files or locations. Only the separately-captured filename let the researchers analyze the output at all. The models also misread normal engineering patterns as vulnerabilities: a .env / dotenv setup flagged as hardcoded production secrets, standard-library constants like calendar.MONDAY flagged as secrets, license headers and Python dunder methods flagged as injection. qwen2.5 was the least bad (fewest false positives); gemma3 the worst. The authors note that closed or specialist models with better prompting might score higher — but the magnitude of the gap suggests the problem is structural, not a prompt-tuning artifact.
Why it matters
This is the defensive mirror image of recent agentic-vulnerability work like Code-Augur and long-horizon bug-hunting benchmarks: when you strip away the scaffolding and just point an off-the-shelf model at a codebase, it underperforms a 1990s-style pattern matcher. The danger isn’t that the agent finds nothing — it’s that it produces fluent, confident output that is mostly noise, encouraging developers to “fix” non-issues while real findings get buried. That false confidence is its own risk surface, and it compounds the trust problems already documented around coding-LLM usability and AI-introduced code defects. Speed makes it worse: Bandit cleared each repo in under a minute, while Snitch took one to four hours per repo (~672 LLM calls, 10-12 hours total).
Defenses
For teams weighing an LLM agent as a security scanner, the practical takeaways:
- Keep the deterministic tool as the source of truth. Rule-based SAST (Bandit, Semgrep, CodeQL) is fast, reproducible, explainable, and CVE-linked. Treat the LLM as an optional second pass, never the gate.
- Never trust an LLM’s line numbers. The
wherefield hallucinated routinely; any agent finding must be re-anchored to the actual source before a human acts on it. - Expect a flood of false positives. Tune for the
.env/dotenv, constants-as-secrets, and comment/license patterns the study saw — or you will train developers to ignore the tool entirely. - Score before you adopt. Reproduce the paper’s cheap metric — recall and false-positive rate against a vetted baseline on your code — rather than trusting a vendor demo.
- Use LLMs where they’re actually strong. Notably, the researchers used Claude to generate the cross-referencing script that parsed the noisy output — LLM-as-assistant to a deterministic pipeline, not LLM-as-scanner.
Status
| Item | Reference | Date | Notes |
|---|---|---|---|
| Empirical assessment | arXiv:2606.11672 | 2026-06-10 | Snitch agent vs Bandit, 224 files / 101,816 LOC |
| Models tested | gemma3, llama3.1, qwen2.5 (via Ollama) | 2026-06-10 | All general-purpose open-weight models |
| Best composite | qwen2.5 | 2026-06-10 | Still negative on all three repos |
| Baseline tool | Bandit (PyCQA) | — | Rule-based Python SAST, CVE-linked |
The takeaway: in mid-2026, a general-purpose open-source LLM agent is a poor substitute for a deterministic SAST tool — high false positives, fabricated locations, and hours of runtime. The useful pattern is the inverse of the hype: let proven tooling do the detection, and let the model assist with triage and explanation under human review.