system: OPERATIONAL
← back to all hacks
DEFENSE MEDIUM NEW

Catching credential exfiltration in LLM agents before the output token

Published June 2, 2026, an arXiv paper detects agent credential leaks before any output token is emitted — combining activation probes, calibrated honeytokens, and multi-turn leakage accounting.

2026-06-04 // 7 min affects: qwen2.5, llama-3.1-8b, mistral-7b

What is this?

“Caught in the Act(ivation)” (arXiv:2606.04141, Kargi Chauhan and Pratibha Revankar, posted June 2, 2026, CC BY 4.0) studies a narrow but common agent failure: LLM agents routinely place sensitive credentials — API keys, database passwords, OAuth tokens, SSH identities — in the same context window as untrusted retrieved content. That co-location is exactly what makes indirect prompt injection (Greshake et al., 2023) dangerous: an instruction hidden in a web page, email, or tool result can steer the agent into revealing the very secrets it needs to act. It is the credential half of the lethal trifecta, and a real campaign — the Marimo post-exploitation case — showed an agent draining cloud credentials and a database in under two minutes.

The paper’s argument is that most deployed defenses look in the wrong place. Input filters and output monitors operate at the text level, so they are evaded by Base64/hex encodings, Unicode substitution, paraphrase, or slow leakage spread across turns. The authors propose moving detection earlier (into the model’s internals) and later (across the whole conversation), and bundle three monitors into a research prototype they call the Agentic Immune System (AIS) — explicitly “aspirational,” not deployment-ready.

How it works

AIS combines three deliberately different signals. The threat model is gray-box / Kerckhoffs-style: the attacker may know the defenses exist but cannot read calibrated thresholds, probe weights, or which visible tokens are honey.

  • CIFT — activation-based access detection. Forward hooks capture hidden states from the last K = ⌊0.25 · L_total⌋ transformer layers. A per-layer Mahalanobis deviation is combined through learned non-negative weights into a Causal Flow Score, classified before any output token is emitted. Because it reads internal features, encodings that fool text detectors do not change the signal.
  • DP-Honey — calibrated honeytokens. Format-matched canary credentials are generated from character models and wrapped in split conformal prediction, which sets the detection threshold from a held-out benign set instead of hand tuning. Real credentials live in a labeled map consumed by the tool runtime; only honeytokens are inserted into the model-visible context.
  • Nimbus — cumulative leakage scoring. Multi-turn exfiltration is treated as an information-flow budget, scored in bits (InfoNCE-based). A single turn is capped near log₂(17) ≈ 4.09 bits, so Nimbus is not meant to catch a one-shot leak — it catches slow leakage that per-turn detectors never accumulate.

The per-turn pipeline is a logical OR over the three alarms — illustrative, not an exploit:

for each turn:
    context  <- inject DP-Honey honeytokens
    score    <- CIFT(activations)            # pre-output
    if score > thr: flag credential access
    leak     <- Nimbus.update(estimated_bits) # cumulative
    if leak > budget B: intervene

Why it matters

The numbers are promising but the authors are careful to call them preliminary. On a held-out single-turn set (Qwen-7B), CIFT reports AUROC 0.998 / F1 0.990 / FPR 0.015 versus 0.85 AUROC for a LlamaGuard text baseline, and it stays stable under held-out encoding transforms that degrade text detectors. DP-Honey reaches 1.000 precision/recall with conformal coverage of 0.998 and no threshold tuning. On a synthetic suite of 50 conversations × 20 turns (4–8 attack turns each), Nimbus detects 0.90 of low-rate attacks where a per-turn LlamaGuard catches 0.12 and substring matching catches 0.00. The integrated prototype hits 0.94 detection at 0.018 FPR with about +16 ms added latency.

The honesty matters more than the headline. The multi-turn benchmark is in-house and small; CIFT needs white-box activation access, so it does not apply to API-only models; cross-model transfer is untested; and Nimbus is a learned lower-bound signal, not a certified leakage bound. The authors name the worst gap themselves: credentials passed through structured tool-call arguments are a “severe structural blind spot,” because real agents often use secrets in serialized API calls rather than natural-language text. A multi-session attacker can also reset the budget by restarting conversations unless leakage state is stored across sessions. The very high AUROC in a controlled activation-probe setting is exactly the kind of result that needs independent replication.

Defenses

The paper is itself a defense proposal; the transferable lessons hold whether or not you adopt this specific prototype.

  1. Don’t co-locate secrets with untrusted text. The root cause is credentials sharing a context window with retrieved content. Keep real secrets in a tool-runtime credential map the model never sees, and pass only references.
  2. Add a pre-output signal, not just output filtering. Text-level scanning is evaded by encodings and paraphrase. If you run open-weight models, an activation probe gives you a cheap (~1 ms) check that fires before the token is rendered.
  3. Use calibrated honeytokens, and calibrate them properly. Conformal calibration removes brittle hand-tuned thresholds. Pair with deception-based detection so a touched canary is high-confidence evidence.
  4. Account for leakage over time. Per-turn detectors miss slow drips. Track a cumulative budget across the whole session — and persist it across sessions so an attacker can’t reset by reconnecting.
  5. Instrument tool-call arguments, not just chat output. The prototype’s biggest blind spot is the place credentials actually flow. Apply the same canary and leakage logic to serialized tool arguments before dispatch.
  6. Treat the numbers as preliminary. White-box probes, small synthetic suites, and 0.99+ AUROC demand replication on your own corpus before you trust them.

Status

ItemReferenceDateNotes
Paper postedarXiv:2606.04141v1 [cs.CR]2026-06-02Chauhan & Revankar, CC BY 4.0
CIFT (single-turn, Qwen-7B)Table 12026-06-02AUROC 0.998, F1 0.990, FPR 0.015, pre-output
Nimbus (multi-turn synthetic)Table 32026-06-020.90 detection vs 0.12 per-turn LlamaGuard
Integrated AIS (single-turn)Table 42026-06-020.94 detection, 0.018 FPR, +16 ms
Known blind spotLimitations §62026-06-02Credentials in structured tool-call arguments
Foundational threatGreshake et al. (arXiv:2302.12173)2023Indirect prompt injection origin

The right framing is not “credential exfiltration is solved.” It is that detection should stop being a single text classifier at the output: combine pre-output access monitoring, calibrated canary detection, and temporal leakage accounting — and instrument the tool-call layer where secrets really move.

Sources