Forgotten but recoverable: why LLM machine unlearning keeps leaking back
Multiple 2025-2026 papers show 'unlearned' knowledge in LLMs is routinely recoverable — via quantization, adversarial prompting, and now reasoning traces. Treating unlearning as erasure is a mistake.
What is this?
Machine unlearning is the family of techniques that try to make a trained language model “forget” a specific slice of what it learned — a person’s data after a deletion request, copyrighted text, or hazardous knowledge such as the bioweapon and cyber content in the WMDP benchmark. It is increasingly invoked as a compliance and safety control: rather than retrain a model from scratch (expensive) every time something must be removed, you run an unlearning procedure that cheaply suppresses the target.
A steady line of research from 2024 through 2026 keeps arriving at the same uncomfortable conclusion: most unlearning does not erase knowledge, it hides it — and the hiding is shallow. The newest entry, Towards Unveiling Vulnerabilities of Large Reasoning Models in Machine Unlearning (arXiv:2604.04255, Iowa State University, posted April 2026), extends the problem to reasoning models. It joins REBEL (arXiv:2602.06248, February 2026), the ICLR 2025 quantization paper, a step-by-step reasoning attack (June 2025), and a systematization of knowledge (June 2025) in showing that “forgotten” is not the same as “gone.”
How it works
The core problem is one of evaluation. Standard unlearning benchmarks query the model with benign, direct questions (“Who is X?”) and declare success when the answer no longer appears. But suppressing a model’s most likely output is not the same as removing the underlying representation. Several independent recovery channels exploit that gap:
Recovery channel What it exploits Reported effect
---------------------- --------------------------------------- ----------------------------
Quantization Unlearning nudges weights only slightly; Forget-knowledge retained
low-precision rounding undoes the nudge rises ~21% -> ~83% at 4-bit
Adversarial prompting Benign-query metrics miss residual REBEL ASR up to 60% (TOFU),
(evolutionary search) knowledge reachable by harder prompts 93% (WMDP)
Reasoning probes Step-by-step elicitation pulls "erased" 62.5% of crafted prompts
facts back into the output recovered target facts
Reasoning-model attack Long rationales are a weak optimization Misleading-but-convincing
surface during unlearning itself traces; wrong final answers
The quantization result is the most vivid. Because utility-preserving unlearning only perturbs weights gently, simply converting the unlearned model to 4-bit — a routine deployment step — restores an average of roughly 83% of the “forgotten” knowledge, versus ~21% retained at full precision. REBEL attacks from the prompt side: an evolutionary loop evolves adversarial queries that pull residual knowledge back out, reaching attack success rates up to 60% on TOFU and 93% on WMDP, while ordinary benign queries would have scored the same models as “successfully unlearned.” No exploit payload is needed to understand the lesson, and none is reproduced here.
Why it matters
The risk surface is two-sided. On the privacy side, organizations that run unlearning to satisfy a deletion or right-to-erasure request may be telling regulators and users that data is gone when it is recoverable by anyone who quantizes the model or prompts it cleverly. On the safety side, the WMDP numbers are the alarming ones: hazardous knowledge that a safety team believed it had stripped out can resurface at high rates, especially after the quantization that almost every open-weight deployment performs.
The deeper point is methodological. A defense that is only ever measured against the easiest possible test will look far stronger than it is. The 2026 reasoning-model work sharpens this: as models are trained to “think” in long chains, those chains create new extraction surface — the very reasoning that improves capability also gives an attacker more places to coax suppressed content back out. Unlearning evaluated with benign single-turn questions is, in effect, security theater.
Defenses
- Do not treat unlearning as erasure. For genuine deletion or compliance, the only robust guarantee remains not training on the data, or retraining without it. Unlearning is a mitigation, not a delete button.
- Evaluate adversarially, not benignly. Test unlearned models with paraphrase, multi-turn, and reasoning-style probes — and with evolutionary attackers like REBEL — not just direct questions. Report the attack success rate of recovery, not only benign forget-loss.
- Include quantization in the threat model. Measure forgotten-knowledge recovery at the precisions you actually ship (4-bit, 8-bit), since 4-bit can undo unlearning while 8-bit often does not.
- Prefer robustness-aware unlearning. Methods that flatten the loss landscape around the unlearned point (sharpness-aware minimization and successors) are reported to resist relearning and recovery better than point-minimization methods.
- Layer with access control. Where hazardous or private content must not leak, combine unlearning with output filtering, retrieval restrictions, and least-privilege access rather than relying on the model having truly forgotten.
Status
| Work | Reference | Date | Reported finding |
|---|---|---|---|
| Quantization recovery | arXiv:2410.16454 (ICLR 2025) | 2024-10 | 4-bit quantization restores ~83% of forgotten knowledge |
| Reasoning-elicitation attack | arXiv:2506.17279 | 2025-06 | 62.5% of crafted prompts recover target facts |
| SoK: unlearning for LLMs | arXiv:2506.09227 | 2025-06 | Systematizes recovery as a structural weakness |
| REBEL | arXiv:2602.06248 | 2026-02 | Evolutionary recovery up to 60% (TOFU) / 93% (WMDP) |
| LRM unlearning vulnerability | arXiv:2604.04255 | 2026-04 | Reasoning traces are a new unlearning attack surface |
The durable, transferable point is not a single flaw in a single method: it is that the field’s measurement has consistently overstated forgetting. Across quantization, adversarial prompting, and reasoning probes — and now reasoning models specifically — knowledge that benign benchmarks call “unlearned” keeps coming back. Until evaluation routinely includes these recovery channels, an unlearning claim should be read as “harder to retrieve,” not “removed.”