system: OPERATIONAL
← back to all hacks
SUPPLY CHAIN MEDIUM NEW

GGUF model files are untrusted input: llama.cpp's recurring parser RCEs

CVE-2026-33298 (March 2026) and a May 15, 2026 oss-sec disclosure show llama.cpp's GGUF parser keeps hitting integer-overflow heap corruption: loading a crafted model file can mean RCE.

2026-06-05 // 6 min affects: llama.cpp, ggml, gguf, ollama, lm-studio

In brief The file format that powers local LLMs — GGUF — is parsed by memory-unsafe C/C++ in llama.cpp and its ggml core. CVE-2026-33298, published March 18, 2026 (CVSS 7.8), is an integer overflow in tensor-size math that lets a crafted model file corrupt the heap and potentially execute code. It is not the first: CVE-2025-53630 (July 2025) was the same class, and a full-disclosure advisory dated May 15, 2026 lists six more GGUF-parser weaknesses. The lesson is structural: a model file is untrusted input, and the loader is an attack surface.

What is this?

GGUF is the de-facto container format for local large language models. When you pull a quantized model from Hugging Face and run it under llama.cpp, Ollama, LM Studio, or any tool built on the ggml library, that .gguf file is parsed by C/C++ code on your machine before any inference happens.

CVE-2026-33298, disclosed in GitHub Security Advisory GHSA-96jg-mvhq-q7q7 on March 18, 2026 (CWE-122 / CWE-190, CVSS 7.8, AV:L/AC:L/PR:N/UI:R), is an integer overflow in the ggml_nbytes function — the routine that computes how many bytes a tensor needs. A model file with crafted tensor dimensions makes that calculation wrap around, and the parser allocates a buffer far smaller than the tensor actually requires. What looks like a math bug becomes memory corruption.

The recurrence is the real story. The same overflow class hit gguf_init_from_file_impl a year earlier in CVE-2025-53630 (published July 10, 2025), and on May 15, 2026 an independent researcher published a full-disclosure advisory on oss-sec describing six further weaknesses across gguf.cpp and the Python gguf_reader.py reference — from a missing upper bound on the alignment value to unchecked enum and string-length fields.

How it works

The pattern is consistent across all of these issues. A size or offset is computed from values read directly out of the attacker-controlled file. On a 64-bit host, multiplying large tensor dimensions overflows size_t and wraps to a small number; the loader then trusts that small number to size a heap allocation. When subsequent code copies or indexes the tensor at its true (much larger) extent — or follows a tensor offset that points past the undersized buffer — it reads or writes outside the allocation. That heap out-of-bounds write is the primitive an attacker builds on to corrupt adjacent structures and, in the worst case, steer execution.

The CVSS vector matters for triage: the attack vector is local with user interaction required (UI:R). Nobody reaches into your process remotely — the trigger is you loading the file. But in the local-LLM world, downloading a stranger’s GGUF quantization and loading it is an everyday action, often automated by a one-line pull. The file is treated as inert data; the parser treats its fields as gospel.

Why it matters

ggml and llama.cpp sit underneath a large share of the local-AI ecosystem — Ollama, LM Studio, GPT4All and many smaller tools embed the same parsing code. A weakness in GGUF parsing is therefore not one product’s bug but a shared dependency’s bug, inherited by everything downstream. And because community model hubs make it trivial to publish and share GGUF files, the distribution channel for a malicious model is already built and widely trusted.

This mirrors the long-running lesson from Python pickle-based checkpoints: model artifacts are code-adjacent, not passive data. Safe-by-format efforts (like avoiding pickle) address the deserialization-of-code problem, but they do not make the binary parser itself memory-safe. CVE-2026-33298 and its siblings show that even a “data-only” format becomes dangerous when its parser miscounts bytes.

Defenses

  1. Update. CVE-2026-33298 is fixed in llama.cpp build b7824. If you ship or embed ggml/llama.cpp, rebase to a current release and track its security advisories — this class has recurred, so pin and watch rather than fix-and-forget.
  2. Treat GGUF as untrusted. Load models only from publishers you trust; verify checksums and provenance before loading. A community quantization from an unknown uploader deserves the same suspicion as an unsigned binary.
  3. Sandbox the loader. Run model loading and inference in a minimal container with no access to secrets or sensitive data stores and with restricted egress. Apply memory and CPU limits so an overflow or memory-exhaustion bug degrades to a crash, not a pivot.
  4. Validate before loading where you can. Prefer maintained loaders that bounds-check tensor sizes, offsets, alignment and string lengths against the actual file size. If you build tooling on ggml, fuzz the GGUF path and use checked arithmetic (e.g. __builtin_mul_overflow) on every size calculation.
  5. Monitor. Segfaults, sanitizer aborts or unexpected memory spikes during model load are a useful hunting signal that a file tried to break the parser rather than be served by it.

Status

ItemReferenceDateNotes
CVE-2026-33298GHSA-96jg-mvhq-q7q72026-03-18Integer overflow in ggml_nbytes, heap overflow, CVSS 7.8, fixed in b7824
oss-sec GGUF advisoryseclists.org (V-01…V-06)2026-05-15Six further gguf.cpp / gguf_reader.py parser weaknesses, full disclosure
CVE-2025-53630GHSA-vgg9-87g3-85w82025-07-10Same overflow class in gguf_init_from_file_impl, heap OOB read/write

The right framing is not “patch one CVE.” It is that GGUF parsing in ggml/llama.cpp has repeatedly produced the same integer-overflow-to-heap-corruption class — so any stack that loads model files from outside its trust boundary should add provenance checks and isolation that do not depend on the parser getting every size calculation right.

Sources