Prompt theft by timing: prefix-cache side channels in multi-tenant LLMs
Shared prefix caching makes LLM APIs faster — and leaks prompts. By timing the first token, an attacker can rebuild another tenant's prompt. A March 2026 paper defends it without killing performance.
What is this?
Modern LLM serving systems cache the work they have already done. Automatic Prefix Caching (APC) reuses the key–value tensors computed for the beginning of a request whenever a later request starts with the same text — a large speed-up for long documents and multi-turn chats. The problem is that a cache hit is faster than a cache miss, and that timing difference is observable from the outside. In a multi-tenant deployment where the prefix cache is shared across users, an attacker can send crafted prompts, measure how long the first token takes, and learn whether their guessed prefix matches something another tenant already cached. Repeat the probe, token by token, and you can reconstruct a stranger’s prompt.
This is not theoretical. A Stanford team — Chenchen Gu, Xiang Lisa Li, Rohith Kuditipudi, Percy Liang and Tatsunori Hashimoto — audited real APIs in Auditing Prompt Caching in Language Model APIs (arXiv 2502.07776, February 2025, accepted at ICML 2025) and detected global cache sharing across users in seven API providers, including OpenAI. The same timing signal even leaked an architectural fact: that OpenAI’s embedding model is a decoder-only Transformer. The newest entry in the line, CacheSolidarity (arXiv 2603.10726, March 2026), is what prompted this write-up.
How it works
There is no payload to redact here — the attack is pure measurement against a black-box API. The mechanism:
Stage What happens
--------------------------- --------------------------------------------------
1. Pick a target prefix Guess a candidate string that a victim's prompt
might begin with (e.g. a system-prompt header,
an email address, an account ID)
2. Probe Send a request starting with that candidate and
record the time-to-first-token (TTFT)
3. Classify hit vs miss Low TTFT => the prefix was already cached by
someone else (hit); high TTFT => miss
4. Extend & repeat Append the next token, re-probe, and walk the
prefix forward one piece at a time
The attacker never reads another user’s data directly; they infer it from latency. Follow-on work has shown how far this goes: PromptPeek (Wu et al., 2025), as reported in the SafeKV study, reconstructs prompts with up to 99% accuracy when the prompt template is known and ~95% without it, targeting the radix-tree prefix scheduling used by serving frameworks such as SGLang. Both CacheSolidarity and SafeKV list the same set of systems that ship prefix caching: OpenAI, DeepSeek, Google Gemini, MoonShot Kimi, vLLM and SGLang.
Why it matters
The exposure scales with how much sensitive text ends up in a shared prefix. The SafeKV authors note that common pre-training corpora such as C4 and the Pile carry usernames, phone numbers, payment-card numbers and Social Security numbers — exactly the kind of token sequence that becomes inferable if it is cached and shared across tenants. Anything you place at the front of a prompt — a system prompt with embedded secrets, a customer record prepended for context, a retrieved document — is a candidate for reconstruction by a co-tenant.
The honest framing: this is a confidentiality leak under specific conditions, not remote code execution. The attacker needs a multi-tenant service that shares cache across security boundaries, a measurable timing gap, and enough probing budget. CacheSolidarity’s own analysis stresses that exploitability depends on hardware, model size, system load and request length — on some configurations the hit/miss gap is too noisy to weaponise. That is a real mitigating factor, but it is a property of the deployment, not a guarantee.
Defenses
Research has converged on three families of defense, with a clear progression in how surgically they cut.
- Full user-level isolation. Disable cross-tenant prefix sharing entirely; each user (or org) gets a private cache. This closes the channel but throws away APC’s benefit. SafeKV measured isolation inflating time-to-first-token by up to ~38% on large models; CacheSolidarity reports similar costs. Effective, blunt, expensive.
- Timing obfuscation. Pad or jitter responses so hits and misses look alike. The catch, spelled out by both papers: the attacker only needs a statistical separation, so you must pad hit responses toward the slow tail of the miss distribution — which erases most of the latency win and inflates P95/P99.
- Selective isolation. Share by default, isolate only what is risky. SafeKV (arXiv 2508.08438) classifies sensitive prefixes asynchronously and keeps them private, recovering throughput up to 2.66× versus full isolation. CacheSolidarity goes further: it tracks prefix ownership, flags prefixes that multiple users suspiciously reuse, and isolates only those — plus an “Activator” that turns isolation on only when the timing gap is actually distinguishable for the current hardware and load. Built on vLLM and evaluated across nine models (0.5B–13B), it reports up to 70% higher cache reuse and 30% lower latency than user-level isolation, with negligible overhead, and the authors say it will be open-sourced.
For teams running or buying LLM inference, the practical checklist:
- Don’t put secrets in shared-prefix position. Keep API keys, credentials and per-user PII out of system prompts and out of the cacheable head of a request.
- Scope the cache to a trust boundary. Per-organization or per-tenant cache scoping removes the cross-tenant channel; confirm your provider does this rather than sharing globally.
- Treat cache policy as a procurement question. Ask vendors whether prefix/semantic caching is shared across customers and how long entries persist — the providers in the audit had retention windows from minutes to days.
- If you self-host, prefer selective isolation. Full isolation works but is costly; SafeKV- and CacheSolidarity-style selective approaches keep the speed-up while closing the channel.
Status
| Item | Reference | Date | Notes |
|---|---|---|---|
| Empirical API audit | arXiv 2502.07776 (Stanford, ICML 2025) | 2025-02 | Global cross-user cache sharing in 7 providers incl. OpenAI |
| Prompt-stealing attack | NDSS 2025 (Wu et al.) | 2025 | Up to 99% / 95% prompt reconstruction via shared KV cache |
| SafeKV (selective isolation) | arXiv 2508.08438 | 2025-08 | Async sensitivity classification; up to 2.66× throughput vs isolation |
| CacheSolidarity (selective isolation) | arXiv 2603.10726 | 2026-03 | Flags suspicious reuse; up to 70% more reuse, 30% lower latency than isolation |
The throughline of two years of this research is that the speed optimisations LLM serving depends on — prefix and KV-cache sharing — are also a confidentiality side channel when the cache crosses a trust boundary. The 2026 work doesn’t ask you to choose between speed and safety; it shows the channel can be closed by isolating the few prefixes that matter rather than walling off every user.