Quick Answer
4 GB handles Phi-4 Mini and Gemma 2B. 6 GB runs Llama 3 8B at Q4. 8 GB handles Mistral 7B at Q5. 12 GB fits Qwen 14B Q4. 16+ GB is needed for 70B models at Q4.
Updated: 2026-05
Key Takeaways
As of May 2026, a model's VRAM need follows a simple formula: parameter count in billions × 0.7 = approximate GB at Q4 quantization. A 7B model needs ~4.9 GB for weights, plus 0.5–1 GB of context overhead. This is why 6 GB is the minimum for the 7–8B tier, and why 12 GB unlocks the 14B tier with breathing room.
Use the table below as a quick decision reference. The "Speed" column assumes Ollama on a desktop GPU running at default context (2048 tokens).
Always keep 1–2 GB of VRAM free above your model's stated needs. Operating systems, browser tabs, and Ollama's runtime consume 500 MB–1 GB even with no model loaded. A 6 GB card running a 5.5 GB model leaves only 500 MB headroom — you'll hit out-of-memory errors the moment you increase --num-ctx beyond 2048 tokens. For the 6 GB tier with safe headroom, see best local LLMs for 6 GB VRAM.
| VRAM | Best Model at Q4_K_M | Speed |
|---|---|---|
| 4 GB | Phi-4 Mini Q4 | ~25 tok/s |
| 6 GB | Llama 3 8B Q4_K_M | ~20 tok/s |
| 8 GB | Mistral 7B Q5_K_M | ~18 tok/s |
| 12 GB | Qwen 14B Q4_K_M | ~15 tok/s |
| 16+ GB | Qwen 32B Q4 or Llama 70B partial | ~8 tok/s |
If a model exceeds your VRAM, you have three options: lower the quantization (Q4_K_M instead of Q5), reduce the context window with --num-ctx 2048, or let Ollama offload layers to system RAM.
CPU offload works but is slow — each layer moved to RAM adds latency. For interactive use, stay within your GPU's VRAM limit. Reducing context from 4096 to 2048 tokens saves approximately 2 GB on a 7B model.
For a full breakdown of model sizes and the math behind VRAM estimates, see the complete VRAM guide for local LLMs. For the 7B tier specifically, see how much RAM a 7B model needs.
--num-ctx 2048 in your Ollama command. This reduces VRAM by up to 2 GB on 7B models without changing the model file.