The serving layer is the attack surface: concurrency bugs in vLLM and SGLang
A May 2026 fuzzer, GRIEF, treats concurrent request traces as inputs and finds 15 serving-layer bugs (2 CVEs) in vLLM and SGLang: cross-request output contamination, noisy-neighbor DoS, and delayed crashes — no malformed input required.
In brief Most LLM security testing aims at the model: jailbreaks, prompt injection, hallucination. But a growing body of work argues the serving framework — the engine that batches requests, shares a KV cache across tenants, and schedules GPU time — is its own security boundary, and a leakier one. A paper published May 2026, Continuous Discovery of Vulnerabilities in LLM Serving Systems with Fuzzing (arXiv 2605.11202), introduces GRIEF, a greybox fuzzer that mutates timed concurrent request traces. In early 8-hour campaigns against vLLM and SGLang it surfaced 15 vulnerabilities, 10 confirmed by maintainers, including 2 assigned CVEs — all triggered by perfectly valid API requests that only become dangerous when they overlap.
What is this?
A modern inference engine like vLLM or SGLang is not a thin wrapper around a model. To hit GPU utilisation targets it runs continuous batching, a shared paged KV cache, prefix sharing, speculative decoding, LoRA adapter multiplexing, and a multi-tenant scheduler. All of that creates shared mutable state across requests that belong to different users. Standard tests never exercise it: model-safety evals check whether a single answer is safe, API tests check whether a single request is handled correctly, and conventional fuzzers look for crashes from malformed bytes. None of them probe what happens when dozens of individually valid requests compete for the same cache and scheduler under realistic concurrency.
GRIEF (arXiv 2605.11202) is the first systematic attempt to fuzz that surface. Its threat model is deliberately weak: an unprivileged API client sending well-formed requests through documented endpoints — no malformed traffic, no privileged access, no host co-location. The result is the first empirical evidence that the serving stack is “a distinct and under-tested vulnerability surface.”
How it works
The key idea is the input abstraction. Instead of mutating a byte string or a single prompt, GRIEF treats a timed request trace as the unit of fuzzing: a timestamped sequence of client events that captures when requests overlap, what shape they carry, and which requests should share a semantic prompt identity. The fuzzer mutates co-batching, prefix-cache reuse, adapter co-scheduling, cancellation timing, and scheduler pressure, then uses lightweight oracles — latency anomalies, KV-cache telemetry, and cross-run output comparison — to flag suspicious runs. Because LLM serving is noisy and non-deterministic, anything suspicious is sent to a controlled-replay stage with majority vote and token log-probability checks, separating real bugs from scheduler jitter.
The findings cluster into three impact classes (Table 1 of the paper):
Failure class Total vLLM SGLang Confirmed Representative impact
State corruption / isolation 7 5 2 7 Cross-request output contamination
Performance pathology 4 2 2 2 Noisy-neighbor denial of service
Crash / liveness 2 1 1 1 Scheduler-process availability loss
In the isolation class, a concurrent workload caused one request to reuse stale KV-cache state from another request, polluting the victim’s output — an isolation break with no crash and no error in the logs. In the performance class, a valid but adversarially shaped request externalised its cost onto co-scheduled users, driving first-token latency up and useful throughput to near zero. In the liveness class, a batch of valid LoRA-adapter requests violated a scheduler invariant and aborted the serving process — and because the crash was delayed relative to the trigger, the logs showed normal prefill and successful responses right up to the failure.
That performance-pathology class is exactly what a separate February 2026 paper, Rethinking Latency Denial-of-Service: Attacking the LLM Serving Framework, Not the Model (arXiv 2602.07878), formalises as a “Fill and Squeeze” strategy: Fill exhausts the global KV cache to induce head-of-line blocking, Squeeze forces the scheduler into repeated preemption. The authors report up to 20–280× slowdown on time-to-first-token at 30–40% lower attack cost than prior model-level slowdown attacks — by targeting the FCFS scheduler state, not the model.
Why it matters
Two properties make this worse than it sounds. First, none of these failures need a malformed input or an explicit error — they are silent. A tenant whose output was contaminated by another user’s KV-cache state has no signal that anything went wrong; an operator watching crash-only dashboards sees a healthy server. Empirical work cited by the GRIEF authors finds that over 35% of inference-engine bugs manifest as non-crash anomalies. Second, the attacker bar is low: a normal paying API user on a shared multi-tenant endpoint can, by timing requests, degrade or contaminate other tenants. For anyone running a shared vLLM or SGLang fleet — internal platform teams, inference-as-a-service providers — this is a cross-tenant confidentiality and availability problem, not a model-quality footnote. It maps cleanly onto OWASP LLM Top 10 Unbounded Consumption and onto classic tenant-isolation requirements.
Defenses
No payloads are needed to act on this; the mitigations are architectural.
- Test concurrency, not just requests. Add serving-layer fuzzing or replay of concurrent traces (GRIEF’s approach) to CI for any self-hosted engine. Per-request correctness tests will not catch these.
- Enforce tenant isolation in the cache. Partition or key the KV cache and prefix-sharing pools by tenant so cross-request reuse cannot cross a trust boundary; disable prefix sharing across tenants where confidentiality matters.
- Make the scheduler fair and bounded. Replace naive FCFS with fair-share or priority scheduling, cap per-tenant KV-cache occupancy and concurrent in-flight tokens, and apply admission control so one client cannot monopolise GPU memory. Recent scheduling work (e.g. head-of-line-blocking mitigations) directly targets the “Fill and Squeeze” pattern.
- Quotas and rate limits as a financial floor. Per-tenant token-rate and request-rate limits are the first defence against noisy-neighbor and denial-of-wallet effects.
- Observe the right signals. Crash-only monitoring is blind here. Track per-tenant TTFT/TPOT distributions, KV-cache hit/eviction telemetry, and preemption counts; treat sustained cross-tenant latency divergence as a security event.
- Patch promptly. The two GRIEF CVEs were confirmed by vLLM/SGLang maintainers — keep engines current and watch their advisories.
Status
GRIEF was published May 2026 (arXiv 2605.11202); it reports 15 findings, 10 developer-confirmed and 2 with assigned CVEs, with further CVE requests pending, across vLLM and SGLang tested on Qwen-2.5-0.5B-Instruct. The complementary latency-DoS analysis (arXiv 2602.07878) dates to February 2026. The authors of GRIEF followed responsible disclosure and deliberately withhold a full exploit inventory; this article likewise describes the classes of failure and their defences, not reproduction steps. The broader takeaway is durable regardless of any single patch: for self-hosted LLMs, the inference engine’s concurrency, caching, and scheduling logic is a first-class security boundary that needs its own testing.