system: OPERATIONAL
← back to all hacks
AGENTS MEDIUM NEW

Stop fixating on the prompt: hijacking an agent's reasoning and memory

An April 2026 paper, JailAgent, drives an agent to malicious tool calls without touching the user prompt — by perturbing its reasoning trace and memory retrieval instead. The prompt was never the whole attack surface.

2026-06-02 // 6 min affects: llm-agents, tool-using-agents, reasoning-models, memory-augmented-agents

What is this?

Most prompt-injection defenses work on one assumption: the danger arrives in the input. Tag the user turn, the retrieved document, the tool output; decide which spans are “instructions” and which are “data”; filter the bad ones. Two recent red-teaming papers argue that this framing misses where modern agents actually decide what to do.

On April 7, 2026, Yanxu Mao, Peipei Liu, Tiehan Cui and co-authors posted Stop Fixating on Prompts: Reasoning Hijacking and Constraint Tightening for Red-Teaming LLM Agents (arXiv:2604.05549). Their framework, JailAgent, induces an agent to perform malicious actions without modifying the user prompt at all. It works on the layer below the prompt: the agent’s reasoning trajectory and its memory retrieval.

JailAgent is the successor to UDora (arXiv:2503.01908, first posted February 28, 2025, last revised November 12, 2025; Jiawei Zhang, Shuang Yang, Bo Li), which introduced the core idea: an LLM agent “extensively reasons or plans before executing final actions,” so the reasoning trace itself is a place an attacker can steer. Together they make a single point for defenders — the chain-of-thought and the memory store are part of the attack surface, not neutral internals.

How it works

This section describes the shape of the technique, not a runnable attack. No payloads, trigger strings, or optimization code are reproduced here; readers who want the method should go to the papers.

UDora’s mechanism, per its abstract, is a loop:

1. Run the agent on the task and capture its reasoning trace.
2. Identify points inside that trace where a small perturbation
   would tip the agent toward a target (malicious) action.
3. Use the perturbed reasoning as a surrogate target and optimize.
4. Iterate until the agent calls the chosen tool / takes the action.

The agent is never told “ignore your instructions.” Instead, its own intermediate reasoning is nudged so the harmful tool call looks like the natural next step the model was already heading toward.

JailAgent (April 2026) generalizes this and removes the dependence on prompt edits entirely. Its three stages, as described by the authors, are:

  • Trigger Extraction — locate the specific cues in context or memory that the agent keys on when it decides to act.
  • Reasoning Hijacking — adaptively steer the reasoning trajectory toward the attacker’s goal in real time, rather than with a fixed template.
  • Constraint Tightening — narrow the agent’s option space with an optimized objective so the unsafe action becomes the path of least resistance.

The authors report that this transfers across models and scenarios. The mechanism matters more than any single number: because the manipulation lives in the reasoning and memory path, a guardrail that only inspects the prompt can pass the input as clean and still watch the agent walk into the action.

Why it matters

Two practical consequences follow.

First, input-side classifiers are not a complete control. A defense that scores the user message and retrieved text for “instruction-like” content can return a clean verdict while the compromise happens during planning. This lines up with the season’s theoretical results — see contextual integrity and data flow is not authority — which argue that data-versus-instruction separation cannot be the whole answer.

Second, memory is live attack surface. When an agent retrieves prior “experience” to plan, poisoned or attacker-shaped memory entries become triggers, echoing what memory-based tool hijacking showed from the poisoning side. An agent with long-term memory carries its own future triggers.

The risk concentrates exactly where agents are most useful: tool-rich deployments with real side effects — sending mail, moving money, running code, calling internal APIs.

Defenses

You cannot patch “the model reasons before acting.” You can make the reasoning matter less to safety.

  1. Gate on the action, not the intent. Validate the final tool call against policy regardless of how the reasoning arrived there. A transfer, a delete, an outbound request should pass an independent check that never reads the chain-of-thought as authority. This is the core of the rule of two and the lethal trifecta framing.
  2. Least privilege on tools. Scope each tool to the minimum, and require explicit human approval for high-impact actions. Hijacked reasoning still hits a wall if the agent simply cannot invoke the dangerous capability unattended.
  3. Treat memory as untrusted input. Validate, attribute and segment retrieved memory the same way you would an external document. Keep provenance on every memory entry and expire aggressively.
  4. Isolate planning from untrusted content. Don’t let the same context that holds attacker-controllable data also drive tool selection. Dual-LLM / plan-then-execute patterns keep the privileged planner away from raw untrusted text.
  5. Monitor the trace, with humility. Anomaly checks on reasoning and tool-selection can catch crude steering, but the contextual-integrity impossibility result warns that trace inspection alone will not close the gap. Use it as detection-in-depth, not a sole control.

Status

ItemReferenceDateNotes
UDora (reasoning-trace hijacking)arXiv:2503.019082025-02-28 → 2025-11-12 (v3)Inserts perturbations into the agent’s own reasoning; code public
JailAgent / Stop Fixating on PromptsarXiv:2604.055492026-04-07No prompt edits; Trigger Extraction → Reasoning Hijacking → Constraint Tightening
Theoretical backdropContextual Integrity, agent surveys2026Data-vs-instruction separation has hard limits

The headline is not “a new jailbreak.” It is a relocation of the problem: for tool-using agents, the prompt is one entrance among several, and the reasoning trace and memory are the ones your input filter never sees.

Sources