system: OPERATIONAL
← back to categories

ADVERSARIAL

(10)

10 hack(s).

ADVERSARIAL MEDIUM NEW

PRAC: hijacking a computer-use agent's choice through its attention

An April 2026 Tübingen paper shows one imperceptibly perturbed product image can concentrate a computer-use agent's visual attention and steer 82% of its selections — without ever touching the output.

2026-06-22//6 min
ADVERSARIAL MEDIUM NEW

When the AI reviewer can't read the figure: cross-modal attacks on peer review

A June 2026 arXiv paper (PaperGuard) shows AI peer reviewers are vulnerable not only through text but through figures — black-box prompt injection and white-box image perturbations both flip verdicts.

2026-06-20//6 min
ADVERSARIAL MEDIUM NEW

Rapid Poison: turning a jailbreak defense into an attack surface

A June 15, 2026 arXiv paper shows the proliferation step inside Rapid Response jailbreak defenses can be poisoned at a 1% rate — forcing up to 100% false positives or 96% false negatives in the guard classifier.

2026-06-19//7 min
ADVERSARIAL MEDIUM NEW

Black-Hole Attack: poisoning a vector database through embedding geometry

An April 7, 2026 paper shows a few vectors placed near the embedding centroid get pulled into up to 99.85% of top-10 results — a query-agnostic, model-agnostic poisoning of vector databases.

2026-06-18//6 min
ADVERSARIAL MEDIUM NEW

M3Att: query-agnostic knowledge poisoning of medical multimodal RAG

A May 2026 paper poisons medical image-text RAG without knowing user queries in advance. Imperceptible image perturbations hijack retrieval; ambiguity-guided text evades the model's self-correction — and pre-filter defenses barely dent it.

2026-06-17//6 min
ADVERSARIAL MEDIUM NEW

CRCP: RAG corpus poisoning that survives chunking and reranking

A June 9, 2026 arXiv paper shows many corpus-poisoning attacks quietly fail after reranking — and proposes CRCP, a chunk-aware variant built to survive realistic multi-stage RAG pipelines. The lesson is about how you evaluate, not just how you defend.

2026-06-15//6 min
ADVERSARIAL MEDIUM NEW

HPAA: typography humans read but moderation LLMs miss

A June 8, 2026 paper introduces Human-Perceptible Adversarial Attacks — harmful text that stays obvious to a human reader but slips past LLM content moderation through typographic manipulation.

2026-06-11//5 min
ADVERSARIAL MEDIUM NEW

SlotGCG: adversarial token position, not just content, drives jailbreaks

A June 2026 paper shows GCG-style jailbreaks get ~14% stronger when adversarial tokens are placed at attention-correlated slots inside the prompt — and keep 42% more success under input filtering.

2026-06-08//6 min
ADVERSARIAL MEDIUM NEW

SilentRetrieval: fluent RAG corpus poisoning that slips past perplexity filters

A May 27, 2026 arXiv preprint introduces a two-stage attack that hides goal-hijacking triggers inside fluent documents, reaching 57% LLM-attack success on Natural Questions and MS MARCO with one poisoned record per query.

2026-05-29//6 min
ADVERSARIAL MEDIUM

Usability as a Weapon: how feature requests turn coding LLMs insecure

A May 11, 2026 arXiv paper shows that asking a coding LLM for a faster, simpler or feature-richer version of secure code reliably drops the security constraints. UPAttack reaches 98.1% on GPT-5.2-chat and Gemini-3.

2026-05-26//7 min