system: OPERATIONAL
← back to all hacks
RESEARCH MEDIUM NEW

The cold-start safety gap: agents are least safe at the very first turn

A June 2026 paper finds tool-calling agents are most vulnerable at the start of a session and grow 9–52% safer after a few routine tasks. The fix is a deployment warm-up, not a new guardrail.

2026-06-17 // 6 min affects: llm-agents, tool-calling-agents, gpt-4, claude-3, llama-3

What is this?

A preprint posted to arXiv on June 5, 2026 (2606.07867) reports a counter-intuitive property of tool-calling LLM agents: they are not equally safe throughout a conversation. An agent is at its most vulnerable at the very first turn of a session, and becomes measurably harder to misuse after it has completed a handful of ordinary, benign tasks. The authors — Chung-En Sun, Linbo Liu and Tsui-Wei Weng of the Trustworthy-ML-Lab — call this the cold-start safety gap.

The size of the gap is not marginal. Across 7 models from 4 families, the agent’s refusal of harmful requests improved by 9% to 52% as the number of preceding benign tasks rose from zero to twenty. The same model, with the same system prompt, was much easier to push into harmful tool use when the malicious request arrived first — before any normal work had happened in the session.

How it works

To measure the effect cleanly, the paper introduces a benchmark called SODA (Safety Over Depth for Agents). SODA controls one variable: how many regular agentic tasks the agent completes before it encounters a safety-critical request, supporting depths of up to 20 preceding tasks. By holding the harmful request fixed and only varying the depth, the authors isolate conversation depth as the cause rather than prompt wording or model version.

The mechanism is visible in the model’s internals. A representation analysis shows that the hidden states drift toward a safety-aligned region of activation space as more benign tasks accumulate in context — the model is, in effect, “warming up” into a safer operating mode. The authors then dissect which part of the preceding conversation does the work, and the answer is specific: the regular tasks themselves are the primary driver of the safety gain, while the agent’s own prior responses contribute little to safety but are necessary to preserve later usefulness. Strip the benign tasks and safety collapses back to the cold-start level; strip the agent’s replies and it stays safe but loses capability on subsequent work.

The findings replicate on independent, open benchmarks — AgentHarm and Agent Safety Bench for safety, and BFCL and API-Bank for utility — which is what separates this from a single-setup curiosity. No jailbreak strings are reproduced here; the contribution is diagnostic. It builds on the established line of agent-misuse measurement work such as AgentHarm (2410.09024), which already showed that frontier-model agents are surprisingly compliant with malicious tasks even without jailbreaking.

Why it matters

Most agent safety evaluation is done on a fresh, single-turn session: spin up the agent, send the harmful prompt, record whether it refuses. This paper says that setup measures the agent at its worst-case safety point and then ships it. A red-team sign-off obtained on turn one does not describe the agent’s behavior on turn ten, and — more importantly — an attacker who reaches the agent first, before any legitimate use, is hitting it exactly where it is weakest.

That has direct consequences for how agents are exposed. A freshly spawned agent handed straight to untrusted input — a new session triggered by an inbound email, a webhook, a customer message, or a per-request ephemeral agent that starts cold every time — sits in the cold-start zone by design. The very architectures people adopt for isolation (a brand-new agent per task, no shared history) can maximize the exposure this paper describes.

Defenses

  • Warm up the agent before exposing it to untrusted input. The paper’s headline recommendation: have the agent complete a few routine, benign agentic tasks at session start before it can receive safety-critical requests. This shifts it into the safer representation region while preserving full capability, and needs no retraining.
  • Stop benchmarking safety only at turn one. Treat conversation depth as an explicit evaluation axis. Measure refusal rates at depth 0 and at realistic operating depths, and gate deployment on the cold-start number, since that is what an early attacker faces.
  • Be deliberate about ephemeral, per-request agents. Spawning a fresh cold agent for every inbound request is good for isolation but lands every request in the weakest safety state. If you use this pattern, pair it with a warm-up sequence or stronger external gating on the first turns.
  • Keep safety outside the model for the cold window. Because the gap is largest before any context accumulates, do not rely on model-level refusal alone at session start. Put input/output filtering, tool-permission checks and human approval on the earliest, highest-risk turns.
  • Re-validate after upgrades. The gap’s size varied across the 7 models tested, so the warm-up depth that suffices for one model may not transfer. Re-measure depth-versus-safety on the exact build you deploy.

Status

ItemDetail
Paper”The Cold-Start Safety Gap in LLM Agents”
arXiv ID2606.07867 (cs.CL)
PostedJune 5, 2026
AuthorsChung-En Sun, Linbo Liu, Tsui-Wei Weng (Trustworthy-ML-Lab)
BenchmarkSODA (Safety Over Depth for Agents), up to 20 preceding tasks
Scope7 models, 4 families
Key resultSafety improves 9–52% from 0 → 20 preceding benign tasks
DriverRegular benign tasks (not the agent’s own replies) drive the safety gain
Cross-checksAgentHarm, Agent Safety Bench (safety); BFCL, API-Bank (utility)
NatureDefensive measurement study — code released, no exploit payloads

Sources