PromptQuorumPromptQuorum

How Much RAM Does a 7B Model Need?

Quick Answer

A 7B model at Q4 quantization needs 5–6 GB of VRAM or RAM. Rule of thumb: model parameters in billions × 0.7 = approximate GB needed at Q4.

  • 7B Q4: 5–6 GB VRAM or unified memory
  • 7B Q5: 6–7 GB VRAM
  • 7B Q8: 8–9 GB VRAM

Updated: 2026-05

Quantization & VRAMBeginner

Key Takeaways

  • A 7B model at Q4 needs 5–6 GB of VRAM — budget 6 GB to include context window overhead
  • Quick rule: parameter count in billions × 0.7 = approximate GB needed at Q4
  • Extending the context window to 16K tokens adds ~4 GB on top of the model weight

The Quick Rule for CPU and GPU

As of May 2026, a 7B model at Q4 needs 5–6 GB of memory — either system RAM (CPU-only inference) or VRAM (GPU inference). The amount is the same; what changes is speed. CPU inference runs at ~5 tokens per second on a modern 8-core processor. GPU inference runs at 20–25 tokens per second on a card with adequate VRAM.

On CPU-only, divide the GPU speed column by 5× for an 8-core processor estimate. A 7B model at Q4 runs at ~5 tok/s on CPU, ~25 on GPU. This 5× gap is why a budget GPU is worth buying for interactive use.

Model SizeQ4 MemoryGPU Speed
3B~2 GB~40 tok/s
7B~5 GB~25 tok/s
8B~5.5 GB~22 tok/s
13B~9 GB~15 tok/s

When to Choose CPU vs GPU

Choose CPU-only when you have 16+ GB of system RAM and your tasks are batch or background (overnight document analysis, scheduled summarization). The ~5 tok/s rate is acceptable for non-interactive work and avoids GPU costs entirely.

Choose GPU when you need interactive chat or coding. The 5× speed difference matters in real-time use. Even a budget RTX 3050 6 GB delivers ~22 tok/s on Llama 3 8B Q4_K_M — fast enough for chat that feels instant.

For the GPU-side full VRAM breakdown by tier, see how much VRAM a local LLM needs. For the complete hardware reference, see the complete VRAM guide for local LLMs.

Quick Answers About 7B Model RAM

Is 8 GB of system RAM enough to run a 7B model without a GPU?
Yes. Running CPU-only, a 7B model at Q4 uses ~5–6 GB of system RAM and runs at 3–6 tok/s on a modern 8-core processor. See the VRAM guide for GPU-accelerated options.
How much VRAM does Llama 3 8B need exactly?
~5.5 GB at Q4_K_M for the model weights. Add 0.5–1 GB for a 4096-token context window. Budget 6–7 GB total to avoid VRAM overflow.
What happens when a model exceeds available VRAM?
Ollama offloads layers to system RAM, which is 10–20× slower. The model still runs but generation speed drops significantly. To prevent this, lower the quantization or reduce context with --num-ctx 2048.
Is GPU inference always better than CPU?
Not for every use case. For batch tasks, scheduled processing, or non-interactive use, CPU at ~5 tok/s is acceptable and avoids GPU costs. For real-time chat or coding, GPU's 20–25 tok/s is essential.