system: OPERATIONAL
← back to all hacks
SUPPLY CHAIN MEDIUM NEW

Sequential data poisoning: splitting a backdoor across post-training stages

A June 3, 2026 paper shows that poison spread across SFT and preference data — negligible at each stage alone — combines into a working backdoor. Per-stage audits create a 'single-attacker illusion'.

2026-06-08 // 6 min affects: sft-pipelines, rlhf-ppo, dpo-alignment, open-weight-llms

What is this?

On June 3, 2026, researchers from the University of Waterloo, the University of Ottawa, the University of Chicago and the Vector Institute posted Sequential Data Poisoning in LLM Post-Training (arXiv:2606.04929). The paper studies a threat model that most poisoning research has ignored: modern alignment is not one training job but a pipeline of stages — supervised fine-tuning (SFT) followed by reinforcement learning from human feedback (RLHF) via PPO, or direct preference optimization (DPO). Each stage draws data from a different, potentially untrusted source.

The central result is what the authors call the “single-attacker illusion”: a poison contribution that looks harmless when each stage is audited in isolation can combine, across stages, into a reliable backdoor. Defenders who screen the SFT set and the preference set separately — the normal practice — can each conclude “this looks clean” and still ship a compromised model.

How it works

The setup distributes a backdoor across the post-training pipeline rather than concentrating it in one dataset. Two (or more) adversaries each contribute poison to a different stage. The paper reports two regimes:

Pipeline      Stage 1 (SFT)        Stage 2 (RLHF/DPO)      Single-stage effect   Combined effect
------------  -------------------  ----------------------  --------------------  ----------------
SFT -> DPO    poison SFT data      poison preference data  each raises ASR a bit additive; split
                                                           on its own           budget beats
                                                                                concentrating
SFT -> PPO    poison SFT data      poison reward-model     near-zero ASR alone   backdoor surfaces
                                   (RM) data               for either stage      only in combination

In the SFT → DPO case the contributions are roughly additive: each stage’s poison raises the attack success rate (ASR) somewhat, and the paper finds that splitting a fixed poison budget across both stages outperforms spending it all in either stage alone. The SFT → PPO case is sharper and more concerning: neither the SFT poison nor the reward-model poison produces a meaningful ASR by itself, yet their combination surfaces the backdoor. The malicious behavior is essentially invisible at the level of any one dataset and only emerges from the interaction between stages.

No reproducible trigger strings or poisoning recipes are reproduced here — the canonical reference is the paper itself. The takeaway is structural: the security boundary you care about is the whole post-training pipeline, not any single dataset within it.

Why it matters

The result reframes a defensive assumption. Prior poisoning work — including Anthropic’s 2025 finding that a small, near-constant number of samples can poison models of any size and the follow-up near-constant poison-count analysis — already showed that the absolute poison budget needed is alarmingly low. Sequential poisoning adds a second axis: the budget can be fragmented across procurement boundaries so no single audit ever sees enough to raise an alarm.

This maps cleanly onto how alignment data is actually sourced in 2026. SFT instruction data, human preference labels, and reward-model training data frequently come from different vendors, crowdsourcing platforms, scraped corpora, or synthetic-generation pipelines — different teams, different trust assumptions, different review at each step. A supplier who can influence only the preference set, and another who can influence only the SFT mix, individually pass review. The composition is where the risk lives, and composition is exactly what stage-by-stage data governance does not test.

Defenses

There is no patch here — this is a class of risk in how alignment pipelines are assembled. The mitigations are about provenance and end-to-end evaluation.

  1. Evaluate the pipeline end-to-end, not stage-by-stage. The core lesson of the paper is that per-stage dataset audits miss interaction effects. Run backdoor and trigger evaluations on the final post-trained model against a held-out, independently constructed probe set — and treat a clean SFT audit and a clean preference audit as necessary but not sufficient.

  2. Track data provenance across every stage. Maintain a bill of materials for SFT data, preference data, and reward-model data: source, vendor, collection method, and review status. Sequential poisoning exploits the fact that these are usually governed independently. Cross-referencing suppliers across stages lets you flag when the same upstream actor touches more than one stage.

  3. Diversify and isolate suppliers per stage. If one vendor provides both your SFT corpus and your preference labels, a single compromised supplier holds both halves of the attack. Separating suppliers — and limiting any one source’s share within a stage — raises the bar for cross-stage collusion.

  4. Hold out trusted, in-house evaluation data. Keep a poison-free, internally curated benchmark of trigger-style and behavioral probes that never enters any training set. Re-run it after each major post-training change. The PPO result shows backdoors that only appear post-composition, so the gate must be after the last stage.

  5. Prefer auditable preference and reward pipelines. RLHF reward-model data and DPO preference pairs are harder to inspect than SFT examples, yet the paper shows they are load-bearing for the attack. Sample, log, and spot-check preference and RM data with the same rigor applied to instruction data.

  6. Red-team the composition explicitly. Add “split-budget” poisoning to your internal red-team playbook: assume an adversary can touch one stage only, and test whether two such limited adversaries combine into something your per-stage screening would clear.

Status

ItemReferenceDateNotes
Sequential Data Poisoning in LLM Post-TrainingarXiv:2606.049292026-06-03Introduces the “single-attacker illusion” threat model
SFT → DPO regimeSame paper2026-06-03Additive; splitting a fixed budget beats concentrating it
SFT → PPO regimeSame paper2026-06-03Neither stage alone is significant; backdoor surfaces only in combination
Near-constant poison countarXiv:2510.071922025-10Context: absolute poison budget needed is low
Small-sample poisoningAnthropic Research2025-10Context: a small number of samples can poison models of any size

The framing to take away is not “another data-poisoning paper”. It is that the unit of audit for a poisoned model has to be the whole post-training pipeline, because an adversary can keep each individual contribution below the threshold any single review would catch — and let the stages do the rest.

Sources