The lethal trifecta is now the default — defend agents at runtime
The lethal trifecta once flagged risky agents. By mid-2026 it describes every useful one, so architecture-level avoidance no longer works. Defense shifts to five runtime behavioral signals.
What is this?
In June 2025, Simon Willison named the “lethal trifecta”: an agent that simultaneously has access to private data, exposure to untrusted content, and the ability to communicate externally is a near-guaranteed exfiltration path through indirect prompt injection. We covered the pattern in The Lethal Trifecta. A year on, a June 15, 2026 CSO feature by researcher Ax Sharma makes a sharp follow-on argument: the trifecta has stopped being a signal of elevated risk because it now describes the baseline operation of essentially every agent people actually deploy. When a red flag is present in 100% of deployments, it no longer distinguishes anything. The defensive question therefore moves from “does my agent have the trifecta?” to “how do I tell a compromised trifecta-agent from a healthy one?” — a runtime problem, not an architecture one.
How it works
The reasoning is straightforward. A support agent reads customer records (private data), ingests user messages and attachments (untrusted content), and calls CRMs or refund APIs (external communication). An inbox assistant reads your mail, processes messages from strangers, and sends replies. Strip any leg and the agent becomes, in Sharma’s phrasing, “closer to a search box than an agent.” Sophos CISO Ross McKerchar made the same point in a May 2026 post, calling the trifecta “the architectural cost of usefulness.” Meta’s Agents Rule of Two — which we covered in Agents Rule of Two — tries to cap agents at two of the three properties per session, but Meta’s own limitations note concedes many wanted use cases will not fit, and that conforming designs “can still be prone to failure.”
The empirical case is already in. Per Breached.Company’s report, between January 7 and 15, 2026 four production assistants — IBM Bob, Superhuman AI, Notion AI, and Anthropic’s Claude Cowork — were each shown to leak data via indirect prompt injection. In the Cowork case, a hidden instruction in an uploaded document steered the agent to exfiltrate files through an allowlisted API domain — invisible to perimeter controls and indistinguishable from normal behavior until the data was gone.
Why it matters
If the trifecta is now table stakes, perimeter and architecture checks alone cannot catch the compromise, because nothing structural separates the malicious action from the legitimate one. A compromised agent does not behave abnormally — it follows instructions, which is its job. What changes is whose instructions, and that only becomes visible at the level of the agent’s actual runtime actions. This reframing matters for anyone scoping detection: budget belongs in agent observability and behavioral telemetry, not only in pre-deployment design review.
Defenses
The CSO feature distills detection into five runtime signals. Treat them as the agent equivalent of EDR/SIEM telemetry — instrumentation most deployments still lack:
- Instruction-following anomalies. Flag actions with no plausible link to the user’s task — e.g., a “summarize this report” request that triggers an outbound request to an unfamiliar domain. The content it ingested told it to.
- Tool-call sequences that break expected topology. A bug-fixing coding agent should touch files, tests, and docs — not reach for email or calendar APIs. Flag cross-workflow tool calls even when each call looks legitimate alone. See runtime tool-call interception.
- Exfiltration via low-bandwidth channels. Encoded image URLs, data tucked into API parameters, links in generated documents. Detection requires correlating what data the agent could access against what it embedded in output — end-to-end action visibility, not just the final response. Related: silent egress.
- Credential access outside task scope. An agent fixing a rendering bug has no reason to read cloud credentials. Least privilege is the architectural control; monitoring out-of-scope secret access is the detection layer that catches its failures.
- Memory-write anomalies. Persistent memory lets a poisoned entry carry dormant trigger instructions across sessions. Audit memory writes that contain instruction-like content, or that occur in sessions which ingested untrusted data. See agent memory poisoning.
None of these replaces least-privilege scoping or human-in-the-loop approval for high-stakes actions — they are the detection layer that assumes those controls will sometimes fail.
Status
| Item | Detail |
|---|---|
| Concept | Lethal trifecta (Willison, June 2025) |
| New claim | Trifecta = default config of deployed agents (CSO, June 15, 2026) |
| Evidence | 4 assistants leaked via injection, Jan 7–15 2026 (Breached.Company) |
| Architecture response | Meta Rule of Two (Oct 2025); Sophos blast-radius reduction (May 2026) |
| Recommended posture | Runtime behavioral detection across 5 signals |
The durable lesson: a control that everyone trips is not a control. As agents converge on the trifecta by design, defenders should stop treating it as a gate and start instrumenting what the agent does at runtime — because the next compromise will look exactly like normal work until the data is already gone.
Sources
- → https://www.csoonline.com/article/4184681/5-runtime-signals-for-catching-a-compromised-ai-agent.html
- → https://simonwillison.net/2025/Jun/16/the-lethal-trifecta/
- → https://ai.meta.com/blog/practical-ai-agent-security/
- → https://www.sophos.com/en-us/blog/inside-the-lethal-trifecta-blast-radius-reduction-in-ai-agent-deployments
- → https://breached.company/the-lethal-trifecta-strikes-four-major-ai-agent-vulnerabilities-in-five-days/