How Much RAM Does a 7B Model Need?

Read in:

🇺🇸en 🇩🇪de 🇫🇷fr 🇯🇵ja 🇨🇳zh 🇪🇸es 🇧🇷pt 🇸🇦ar 🇰🇷ko

Quick Answer

A 7B model at Q4 quantization needs 5–6 GB of VRAM or RAM for efficient inference performance. Rule of thumb: model parameters in billions × 0.7 = approximate GB needed at Q4. GPU delivers ~25 tok/s; CPU delivers ~5 tok/s on same memory.

▸7B Q4: 5–6 GB VRAM or unified memory
▸7B Q5: 6–7 GB VRAM
▸7B Q8: 8–9 GB VRAM

Updated: 2026-05

Quantization & VRAMBeginner

Key Takeaways

✓A 7B model at Q4 needs 5–6 GB of VRAM — budget 6 GB to include context window overhead
✓Quick rule: parameter count in billions × 0.7 = approximate GB needed at Q4
✓Extending the context window to 16K tokens adds ~4 GB on top of the model weight

The Quick Rule for CPU and GPU

As of May 2026, a 7B model at Q4 needs 5–6 GB of memory — either system RAM (CPU-only inference) or VRAM (GPU inference). The amount is the same; what changes is speed. CPU inference runs at ~5 tokens per second on a modern 8-core processor. GPU inference runs at 20–25 tokens per second on a card with adequate VRAM.

On CPU-only, divide the GPU speed column by 5× for an 8-core processor estimate. A 7B model at Q4 runs at ~5 tok/s on CPU, ~25 on GPU. This 5× gap is why a budget GPU is worth buying for interactive use.

Model Size	Q4 Memory	GPU Speed
3B	~2 GB	~40 tok/s
7B	~5 GB	~25 tok/s
8B	~5.5 GB	~22 tok/s
13B	~9 GB	~15 tok/s

When to Choose CPU vs GPU

Choose CPU-only when you have 16+ GB of system RAM and your tasks are batch or background (overnight document analysis, scheduled summarization). The ~5 tok/s rate is acceptable for non-interactive work and avoids GPU costs entirely.

Choose GPU when you need interactive chat or coding. The 5× speed difference matters in real-time use. Even a budget RTX 3050 6 GB delivers ~22 tok/s on Llama 3 8B Q4_K_M — fast enough for chat that feels instant.

For the GPU-side full VRAM breakdown by tier, see how much VRAM a local LLM needs. For the complete hardware reference, see the complete VRAM guide for local LLMs.

Related Guides

▸Mistral Small 24B vs Qwen 14B vs Llama 8B Comparison -- Mistral Small 24B vs Qwen 14B vs Llama 8B comparison
▸Best SSD for Fast Model Loading -- best SSD for fast model loading
▸Can You Run RAG on 2 GB RAM? -- can you run RAG on 2 GB RAM?

Quick Answers About 7B Model RAM

Is 8 GB of system RAM enough to run a 7B model without a GPU?▾

Yes. Running CPU-only, a 7B model at Q4 uses ~5–6 GB of system RAM and runs at 3–6 tok/s on a modern 8-core processor. See the VRAM guide for GPU-accelerated options.

How much VRAM does Llama 3 8B need exactly?▾

~5.5 GB at Q4_K_M for the model weights. Add 0.5–1 GB for a 4096-token context window. Budget 6–7 GB total to avoid VRAM overflow.

What happens when a model exceeds available VRAM?▾

Ollama offloads layers to system RAM, which is 10–20× slower. The model still runs but generation speed drops significantly. To prevent this, lower the quantization or reduce context with --num-ctx 2048.

Is GPU inference always better than CPU?▾

Not for every use case. For batch tasks, scheduled processing, or non-interactive use, CPU at ~5 tok/s is acceptable and avoids GPU costs. For real-time chat or coding, GPU's 20–25 tok/s is essential.

Want the full breakdown?

Read the complete guide →

Related Prompt Bites

← Back to Prompt Bites