Quick Answer
A 7B model at Q4 quantization needs 5–6 GB of VRAM or RAM. Rule of thumb: model parameters in billions × 0.7 = approximate GB needed at Q4.
Updated: 2026-05
Key Takeaways
As of May 2026, a 7B model at Q4 needs 5–6 GB of memory — either system RAM (CPU-only inference) or VRAM (GPU inference). The amount is the same; what changes is speed. CPU inference runs at ~5 tokens per second on a modern 8-core processor. GPU inference runs at 20–25 tokens per second on a card with adequate VRAM.
On CPU-only, divide the GPU speed column by 5× for an 8-core processor estimate. A 7B model at Q4 runs at ~5 tok/s on CPU, ~25 on GPU. This 5× gap is why a budget GPU is worth buying for interactive use.
| Model Size | Q4 Memory | GPU Speed |
|---|---|---|
| 3B | ~2 GB | ~40 tok/s |
| 7B | ~5 GB | ~25 tok/s |
| 8B | ~5.5 GB | ~22 tok/s |
| 13B | ~9 GB | ~15 tok/s |
Choose CPU-only when you have 16+ GB of system RAM and your tasks are batch or background (overnight document analysis, scheduled summarization). The ~5 tok/s rate is acceptable for non-interactive work and avoids GPU costs entirely.
Choose GPU when you need interactive chat or coding. The 5× speed difference matters in real-time use. Even a budget RTX 3050 6 GB delivers ~22 tok/s on Llama 3 8B Q4_K_M — fast enough for chat that feels instant.
For the GPU-side full VRAM breakdown by tier, see how much VRAM a local LLM needs. For the complete hardware reference, see the complete VRAM guide for local LLMs.
--num-ctx 2048.