RESEARCH LOW NEW

Off-the-shelf LLM agents fail at SAST scanning, empirical test finds

A June 10, 2026 study pitted a local LLM agent against the Bandit SAST tool on 101,816 lines of Python. Every model scored a negative composite, dominated by hallucinated findings.

2026-06-22 // 6 min affects: llama-3.1, gemma-3, qwen-2.5, ollama, llm-agents

What is this?

On June 10, 2026, Derek Yohn, Luke Flancher, Mirajul Islam and Khaled Slhoub of the Florida Institute of Technology published Can Open-Source LLM Agents Replace Static Application Security Testing Tools? An Empirical Assessment (arXiv:2606.11672, cs.CR). The question matters because teams shipping AI-generated code increasingly want an AI agent to review it too, and many small teams never wired a real Static Application Security Testing (SAST) tool into their pipeline in the first place.

The authors built a simple Python agent called Snitch, powered in turn by three Ollama-hosted open-weight models — Google’s gemma3, Meta’s llama3.1, and Alibaba’s qwen2.5 — and benchmarked it against Bandit, a vetted, rule-based Python SAST tool, on three real open-source repositories. Their conclusion is blunt: a general-purpose open-source LLM agent is not currently suitable for SAST scanning under realistic conditions.

How it works

The setup is a clean head-to-head. Bandit and Snitch each scanned 224 Python files / 101,816 lines of code across Beaverhabits (a habit tracker), Fail2ban (a security daemon), and Yum (the mature RPM package manager). Bandit’s medium/high-severity findings were treated as the reference “source of truth” — 4 in Beaverhabits, 37 in Fail2ban, 0 in the mature Yum codebase. Snitch ran a fixed expert-SAST prompt asking each model to emit JSON with where, what, why, and fix fields.

Performance was scored as recall (fraction of Bandit’s M/H findings the model also caught, with partial credit) minus a false-positive rate, giving a composite from -1 to 1. Every cell came out negative:

Model	Beaverhabits (R / FP / Comp)	Fail2ban	Yum
gemma3	0.25 / 0.88 / -0.62	0.00 / 0.80 / -0.80	0.00 / 0.89 / -0.89
llama3.1	0.25 / 0.46 / -0.21	0.03 / 0.40 / -0.37	0.00 / 0.59 / -0.59
qwen2.5	0.25 / 0.28 / -0.03	0.01 / 0.14 / -0.12	0.00 / 0.40 / -0.40

The dominant failure mode was hallucination. The where field — the file and line range of each issue — was frequently fabricated, pointing at non-existent files or locations. Only the separately-captured filename let the researchers analyze the output at all. The models also misread normal engineering patterns as vulnerabilities: a .env / dotenv setup flagged as hardcoded production secrets, standard-library constants like calendar.MONDAY flagged as secrets, license headers and Python dunder methods flagged as injection. qwen2.5 was the least bad (fewest false positives); gemma3 the worst. The authors note that closed or specialist models with better prompting might score higher — but the magnitude of the gap suggests the problem is structural, not a prompt-tuning artifact.

Why it matters

This is the defensive mirror image of recent agentic-vulnerability work like Code-Augur and long-horizon bug-hunting benchmarks: when you strip away the scaffolding and just point an off-the-shelf model at a codebase, it underperforms a 1990s-style pattern matcher. The danger isn’t that the agent finds nothing — it’s that it produces fluent, confident output that is mostly noise, encouraging developers to “fix” non-issues while real findings get buried. That false confidence is its own risk surface, and it compounds the trust problems already documented around coding-LLM usability and AI-introduced code defects. Speed makes it worse: Bandit cleared each repo in under a minute, while Snitch took one to four hours per repo (~672 LLM calls, 10-12 hours total).

Defenses

For teams weighing an LLM agent as a security scanner, the practical takeaways:

Keep the deterministic tool as the source of truth. Rule-based SAST (Bandit, Semgrep, CodeQL) is fast, reproducible, explainable, and CVE-linked. Treat the LLM as an optional second pass, never the gate.
Never trust an LLM’s line numbers. The where field hallucinated routinely; any agent finding must be re-anchored to the actual source before a human acts on it.
Expect a flood of false positives. Tune for the .env/dotenv, constants-as-secrets, and comment/license patterns the study saw — or you will train developers to ignore the tool entirely.
Score before you adopt. Reproduce the paper’s cheap metric — recall and false-positive rate against a vetted baseline on your code — rather than trusting a vendor demo.
Use LLMs where they’re actually strong. Notably, the researchers used Claude to generate the cross-referencing script that parsed the noisy output — LLM-as-assistant to a deterministic pipeline, not LLM-as-scanner.

Status

Item	Reference	Date	Notes
Empirical assessment	arXiv:2606.11672	2026-06-10	Snitch agent vs Bandit, 224 files / 101,816 LOC
Models tested	gemma3, llama3.1, qwen2.5 (via Ollama)	2026-06-10	All general-purpose open-weight models
Best composite	qwen2.5	2026-06-10	Still negative on all three repos
Baseline tool	Bandit (PyCQA)	—	Rule-based Python SAST, CVE-linked

The takeaway: in mid-2026, a general-purpose open-source LLM agent is a poor substitute for a deterministic SAST tool — high false positives, fabricated locations, and hours of runtime. The useful pattern is the inverse of the hype: let proven tooling do the detection, and let the model assist with triage and explanation under human review.