How Much VRAM Do You Need for a Local LLM?

Read in:

🇺🇸en 🇩🇪de 🇫🇷fr 🇯🇵ja 🇨🇳zh 🇪🇸es 🇧🇷pt 🇸🇦ar 🇰🇷ko

Quick Answer

4 GB VRAM handles Phi-4 Mini and Gemma 2B comfortably with safe headroom for context expansion. 6 GB runs Llama 3 8B at Q4. 12 GB fits Qwen 14B Q4 efficiently. 16+ GB is needed for 70B models at Q4.

▸4 GB: Phi-4 Mini Q4, Gemma 2 2B
▸6 GB: Llama 3 8B Q4_K_M
▸8–12 GB: Mistral Small Q5, Qwen 14B Q4

Updated: 2026-05

Quantization & VRAMBeginner

Key Takeaways

✓4 GB VRAM runs Phi-4 Mini Q4 and Gemma 2 2B comfortably
✓6 GB is the entry point for Llama 3 8B at Q4_K_M — the most popular local model
✓12 GB unlocks Qwen 14B Q4, the best quality-per-dollar tier
✓70B models require 40+ GB — plan for dual RTX 3090 or Apple M-series with large unified memory

VRAM Requirements by Model Size

As of May 2026, a model's VRAM need follows a simple formula: parameter count in billions × 0.7 = approximate GB at Q4 quantization. A 7B model needs ~4.9 GB for weights, plus 0.5–1 GB of context overhead. This is why 6 GB is the minimum for the 7–8B tier, and why 12 GB unlocks the 14B tier with breathing room.

Use the table below as a quick decision reference. The "Speed" column assumes Ollama on a desktop GPU running at default context (2048 tokens).

Always keep 1–2 GB of VRAM free above your model's stated needs. Operating systems, browser tabs, and Ollama's runtime consume 500 MB–1 GB even with no model loaded. A 6 GB card running a 5.5 GB model leaves only 500 MB headroom — you'll hit out-of-memory errors the moment you increase --num-ctx beyond 2048 tokens. For the 6 GB tier with safe headroom, see best local LLMs for 6 GB VRAM.

VRAM	Best Model at Q4_K_M	Speed
4 GB	Phi-4 Mini Q4	~25 tok/s
6 GB	Llama 3 8B Q4_K_M	~20 tok/s
8 GB	Mistral Small Q5_K_M	~18 tok/s
12 GB	Qwen 14B Q4_K_M	~15 tok/s
16+ GB	Qwen 32B Q4 or Llama 70B partial	~8 tok/s

When Your VRAM Is Not Enough

If a model exceeds your VRAM, you have three options: lower the quantization (Q4_K_M instead of Q5), reduce the context window with --num-ctx 2048, or let Ollama offload layers to system RAM.

CPU offload works but is slow — each layer moved to RAM adds latency. For interactive use, stay within your GPU's VRAM limit. Reducing context from 4096 to 2048 tokens saves approximately 2 GB on a 7B model.

For a full breakdown of model sizes and the math behind VRAM estimates, see the complete VRAM guide for local LLMs. For the 7B tier specifically, see how much RAM a 7B model needs.

Quick Answers About VRAM

Is 8 GB VRAM enough for local LLMs?▾

Yes. 8 GB runs Llama 3 8B at Q5_K_M at around 18 tokens per second, or Mistral Small at Q5_K_M with headroom to spare. Most day-to-day chat and coding tasks are well-covered at this tier.

Can I run a 7B model on 4 GB VRAM?▾

No. A 7B model at Q4 needs 5–6 GB of VRAM. The smallest usable quantization still exceeds 4 GB. See how much RAM a 7B model needs for the full breakdown.

Does context window size affect VRAM usage?▾

Yes. Each additional 1,000 context tokens uses approximately 250 MB of VRAM on a 7B model. The default 2048-token context uses ~0.5 GB; 16,384 tokens uses ~4 GB on top of the model weight.

What should I do if my model uses more VRAM than expected?▾

Set --num-ctx 2048 in your Ollama command. This reduces VRAM by up to 2 GB on 7B models without changing the model file.

Want the full breakdown?

Read the complete guide →

Related Prompt Bites

← Back to Prompt Bites