Bleeding Llama: How CVE-2026-7482 Turns Ollama Into a Memory Leak Weapon
The Discovery
On May 2026, cybersecurity researchers from Cyera dropped a bombshell on the open-source AI community: CVE-2026-7482, a critical heap out-of-bounds read vulnerability in Ollama — the beloved framework that lets developers run LLMs locally without sending data to the cloud.
Codenamed "Bleeding Llama", this vulnerability carries a CVSS score of 9.1. It affects Ollama versions prior to 0.17.1 and, according to internet-wide scans, likely impacts over 300,000 publicly exposed servers globally. That is not a small number. That is a nation-state-scale attack surface.
What is Ollama?
For the uninitiated, Ollama is the go-to tool for running large language models on local machines. With 171,000+ GitHub stars and 16,100+ forks, it is the de facto standard for developers who want to self-host LLMs like Llama, Mistral, and Qwen without paying cloud API costs or leaking proprietary data to third parties.
The pitch is simple: download a model, run it locally, keep your data private.
But what if the very tool protecting your privacy becomes the vector for leaking everything?
The Vulnerability: A Technical Breakdown
The Root Cause
The flaw exists in Ollama's GGUF model loader. GGUF (GPT-Generated Unified Format) is the binary file format used to store quantized LLMs so they can be loaded and executed locally. When a user runs a command like:
bash1ollama create mymodel -f Modelfile
... the server processes the model file through the /api/create endpoint. The vulnerability arises because Ollama's Go-based backend — specifically in fs/ggml/gguf.go and server/quantization.go — uses the unsafe package in the WriteTo() function when creating a model from a GGUF file.
Using unsafe in Go is like welding without a mask: it gives you raw memory access but strips away the language's memory safety guarantees.
The Attack Chain
Here is how the exploitation unfolds in three stages:
-
Crafted GGUF Upload: The attacker sends a malicious GGUF file to an exposed Ollama server. The file declares a tensor with an offset and size that exceed the file's actual length.
-
Heap Out-of-Bounds Read: During quantization, the
WriteTo()function reads past the allocated heap buffer. Because the tensor shape is set to an enormous value, the server keeps reading memory it does not own. -
Exfiltration via Model Push: The leaked memory — containing environment variables, API keys, system prompts, and conversation history — gets embedded into the quantized model artifact. The attacker then uploads this artifact to their own registry via the
/api/pushendpoint and harvests the stolen data at leisure.
This is not a theoretical vulnerability. It is a remote, unauthenticated memory leak that requires zero credentials and zero insider access.
What Gets Leaked?
The process memory of an Ollama server is a goldmine. Successful exploitation can extract:
- API keys for OpenAI, Anthropic, Google, and other LLM providers
- Environment variables containing database credentials, secrets, and tokens
- System prompts that reveal proprietary business logic
- Concurrent user conversation data — every chat session currently in memory
- Source code snippets passed through tools like Claude Code or Cursor
As Cyera researcher Dor Attias noted:
"An attacker can learn basically anything about the organization from your AI inference API keys, proprietary code, customer contracts, and much more."
And it gets worse. Many engineers connect Ollama to tools like Claude Code, which means all tool outputs flow to the Ollama server, get saved in the heap, and potentially end up in an attacker's hands.
The Windows Update Nightmare
If the "Bleeding Llama" vulnerability was not enough, researchers at Striga disclosed two additional flaws in Ollama's Windows update mechanism that can be chained into persistent code execution.
These flaws — a path traversal and a missing signature verification — remain unpatched as of May 2026, despite being disclosed on January 27, 2026. The 90-day responsible disclosure window has elapsed, and the public exploit details are now live.
How It Works on Windows
The Windows desktop client of Ollama:
- Auto-starts on login from the Windows Startup folder
- Listens on 127.0.0.1:11434
- Polls for updates via the
/api/updateendpoint
The attack chain is devastating:
- The attacker controls an update server (or overrides
OLLAMA_UPDATE_URLvia local HTTP) - The malicious update payload is delivered without signature verification
- The payload gets written to the Windows Startup folder via path traversal
- On the next login, Windows executes the attacker's binary automatically
As Bartłomiej Dmitruk of Striga explains:
"The chain produces persistent, silent code execution at the privilege level of the user running Ollama. Realistic payloads include reverse shells, info-stealers exfiltrating browser secrets and SSH keys, or droppers that pivot to additional persistence mechanisms."
Versions 0.12.10 through 0.22.0 are vulnerable. The only interim mitigation is to disable automatic updates and remove the Ollama shortcut from the Startup folder.
Immediate Mitigations
If you run Ollama in any capacity — personal, startup, or enterprise — here is what you do today:
- Upgrade to 0.17.1+ immediately for the "Bleeding Llama" fix
- Limit network exposure — bind Ollama to localhost only (
OLLAMA_HOST=127.0.0.1) - Audit your attack surface — scan for internet-facing Ollama instances with Shodan or Censys
- Deploy authentication — place an API gateway or reverse proxy (like Nginx with basic auth or OAuth) in front of Ollama's REST API
- Firewall your instances — isolate Ollama servers in a private network segment
- On Windows: Disable auto-updates and remove the Startup folder shortcut until a patch drops
- Monitor model uploads — restrict the
/api/createand/api/pushendpoints to authorized users only
The Bigger Picture
This vulnerability is a wake-up call for the "local AI" movement. The entire value proposition of Ollama — privacy, control, and no cloud dependency — hinges on the assumption that running models locally is safer.
But local != secure.
The "Bleeding Llama" flaw proves that a locally hosted inference server can leak secrets more comprehensively than a compromised cloud API key, because the server has direct access to your environment, your files, and your entire development workflow.
As AI infrastructure becomes more distributed — with agents running on laptops, edge devices, and private clouds — the attack surface expands exponentially. The convenience of ollama run llama3.2 comes with the responsibility of hardening a full-stack inference server.
Final Thoughts
Open-source AI tools like Ollama are incredible force multipliers. They democratize access to models that were once locked behind corporate firewalls. But democratization without security is a recipe for catastrophe.
The "Bleeding Llama" vulnerability is not a bug in the model. It is a bug in the infrastructure layer — the code that loads, quantizes, and serves models. And that is exactly where attackers are looking.
If you are an AI engineer, a DevOps operator, or a CTO betting your stack on local LLMs, take this as a lesson: your inference server is now a Tier-0 asset. Protect it like you protect your database.
Stay paranoid. Keep patching. And never trust a model file you did not quantize yourself.
Sources: The Hacker News, Cyera Research, Striga Security, CERT Polska, CVE-2026-7482