system: OPERATIONAL
← back to all hacks
DATA LEAK MEDIUM NEW

Reasoning trace exposure: hiding chain-of-thought doesn't protect it

A May 2026 paper shows that prompting alone can pull a reasoning model's hidden chain-of-thought back into user-visible output — and the recovered traces are good enough to distill a smaller model.

2026-06-16 // 7 min affects: qwen3-14b, qwen3-32b, qwen2.5-7b-instruct

What is this?

Most deployed reasoning models no longer show you their raw chain-of-thought. OpenAI treats the hidden CoT of its reasoning models as an internal monitoring object, Gemini exposes thought summaries rather than raw thoughts, and Claude’s extended thinking offers controlled, not complete, transparency. The stated reasons are safety monitoring and protecting a commercially valuable asset: detailed reasoning traces are exactly what you need to distill a frontier model’s behavior into a cheaper one.

A paper published on arXiv on 30 May 2026“Hidden Thoughts Are Not Secret: Reasoning Trace Exposure in LLMs” (arXiv:2606.00642, Lu et al., National Yang Ming Chiao Tung University and UC Berkeley) — asks whether that interface-level hiding actually protects the trace. Its answer is no. Using a lightweight prompting method the authors call Reasoning Exposure Prompting (REP), a user can induce a victim model to emit reasoning in its visible output that closely matches its hidden internal trace, and the recovered text is good enough to train a smaller student model.

This matters because it reframes a control many providers rely on. Hiding the CoT was meant to stop capability extraction; this work shows the trace can leak through ordinary prompting, with no access to weights, logits, or the monitoring channel.

How it works

The intuition is behavioral, not adversarial. A reasoning model that refuses to reveal its hidden steps when asked directly will still happily continue a pattern it has been shown. REP exploits that gap.

At a high level, REP builds a short prefix of question–reasoning–answer demonstrations, wraps that prefix in an auxiliary, code-like format (the authors test markdown fences, shell-style commands, and similar transformations), and prepends it to the real target question. Because the demonstrations present reasoning as part of the user-visible answer, the model treats visible step-by-step reasoning as the expected output shape and produces it for the target too. No payload is reproduced here; the mechanism is few-shot format conditioning, not a secret string.

To check that the exposed text is the model’s own reasoning and not a plausible-looking substitute, the authors track three traces on open-weight models: the benign internal trace under normal prompting, the internal trace under REP, and the visible trace REP produces. They measure structural validity (is it parseable as reasoning-then-answer), exposure fidelity (does the visible trace match the internal one), behavior preservation (does the answer stay the same), and downstream utility (does training on it help a student).

In their experiments — OpenThoughts-114k as the source set, Qwen3-14B and Qwen3-32B as victims, Qwen3-14B as a shadow model, and Qwen2.5-7B-Instruct as the student — the best configuration was a markdown-fence prefix with three demonstrations. Averaged across benchmarks it produced student gains that beat answer-only supervision by 2.09×, beat training on summarized traces by 1.25×, and beat the trace-inversion baseline from “How to Steal Reasoning Without Reasoning Traces” (arXiv:2603.07267, March 2026) by 1.23×, reaching 96.7% of an oracle that uses the real internal trace. In other words, the exposed traces carry transferable reasoning signal, not just style.

Why it matters

The first consequence is for model-IP protection. Hiding raw CoT is now a documented part of how frontier labs respond to distillation and model-extraction attempts. REP, alongside the March 2026 trace-inversion work, is a second independent demonstration that hidden weights plus a hidden trace are not enough: if a user can interact with the model, they can reconstruct training-grade reasoning. Anyone whose threat model assumed “we don’t show the CoT, so it can’t be copied” should revisit that assumption.

The second consequence is for safety monitoring. The CoT-monitorability position paper signed by 40+ researchers across OpenAI, Anthropic and Google DeepMind (arXiv:2507.11473, July 2025) argues that readable chains of thought are a fragile but valuable safety signal — and warns that pressure on the CoT can make it diverge from the model’s true reasoning. REP adds a wrinkle: the visible trace a user can elicit may not be the same object the provider monitors internally, so reasoning that looks benign in one channel is not a guarantee about the other.

The third is scope. The experiments are on open-weight Qwen3 models, so the precise numbers don’t transfer automatically to closed systems. But the method needs no privileged access, and the deployed systems it targets conceptually — hidden-CoT reasoning models behind an API — are exactly the high-value ones.

Defenses

The paper is candid that this is hard to stop cleanly, and its own findings rule out the easy options.

  1. Don’t rely on deterministic string/format blocks. Blocking a specific delimiter, wrapper, or fence stops one REP variant; the authors note minor format changes preserve exposure. Pattern blocklists are brittle by construction here.

  2. Don’t rely on refusal training alone. Refusal-oriented defenses are insufficient because jailbreak-style prompting can suppress the refusal while REP still provides a format-conditioned path to reconstruct reasoning. Treat “the model declines to show its CoT” as a weak control, not a boundary.

  3. Govern at the distillation layer, not just the trace. Because the leak is the reasoning signal rather than a literal copy of the hidden trace, the durable defenses are the ones aimed at extraction: per-account rate and volume limits, anomaly detection on access patterns consistent with dataset harvesting, output-similarity and canary monitoring, and the terms-of-service / legal track providers already use against distillation campaigns.

  4. Re-cost the “hidden CoT” control in your threat model. If you operate a reasoning model, score hidden-CoT as raising the attacker’s cost, not as protecting the trace. If you consume one, don’t assume a provider’s hidden reasoning is unrecoverable when designing systems that depend on that secrecy.

  5. Keep a faithful internal monitor. Per the monitorability paper, preserve a CoT channel you actually trust for safety review, and account for the possibility that a user-elicited visible trace diverges from it.

Status

ItemReferenceDateNotes
REP / “Hidden Thoughts Are Not Secret”arXiv:2606.006422026-05-30Prompting recovers hidden reasoning; 96.7% of oracle internal-trace utility
Reasoning trace inversion (“How to Steal Reasoning…“)arXiv:2603.072672026-03-07Reconstructs traces from inputs/answers/summaries; REP baseline
CoT Monitorability (40+ authors, OpenAI/Anthropic/DeepMind)arXiv:2507.114732025-07-15CoT as fragile safety signal; faithfulness can degrade under pressure
Empirical scopearXiv:2606.006422026-05-30Victims Qwen3-14B/32B, student Qwen2.5-7B-Instruct; open-weight

The headline is not “a new jailbreak.” It is that an architectural privacy assumption — hide the chain-of-thought and it stays hidden — does not hold against ordinary prompting, for either the IP-protection or the safety-monitoring use that motivated hiding it in the first place.

Sources