DeepSeek-R1 Distill VRAM Cheatsheet (2026)
This page contains links to third-party products for reference. PromptQuorum is not enrolled in any affiliate program — these are plain links that earn no commission. Clicking links and your next steps are entirely your own responsibility. These links do not represent any endorsement or verification by PromptQuorum.
Quick Answer
At Q4_K_M (Ollama default): 1.5B ≈ 4 GB, 7B ≈ 5.5 GB, 8B ≈ 6 GB, 14B ≈ 9.5 GB, 32B ≈ 20.5 GB, 70B ≈ 42 GB. Q8_0 is about 2× the Q4_K_M size and FP16 about 4×, so the 32B at FP16 needs a 64 GB-class setup.
- ▸1.5B: ~1.1 GB file, ~4 GB VRAM (or CPU) at Q4_K_M
- ▸7B: ~4.7 GB file, ~5.5 GB VRAM — RTX 3060 12GB
- ▸14B: ~9 GB file, ~9.5 GB VRAM — RTX 4060 Ti 16GB
- ▸32B: ~19 GB file, ~20.5 GB VRAM — RTX 4090 24GB (tight)
- ▸70B: ~40 GB file, ~42 GB VRAM — dual-GPU or 48 GB
- ▸Rule: Q8_0 ≈ 2× Q4_K_M; FP16 ≈ 4× Q4_K_M
Updated: 2026-06-19
Key Takeaways
- ✓Q4_K_M (Ollama default) VRAM: 1.5B ~4 GB, 7B ~5.5 GB, 8B ~6 GB, 14B ~9.5 GB, 32B ~20.5 GB, 70B ~42 GB.
- ✓Q8_0 is roughly 2× the Q4_K_M size; FP16 is roughly 4× the Q4_K_M file size.
- ✓The 14B at Q4_K_M (~9.5 GB) is the sweet spot — fits a 16 GB card with context headroom.
- ✓The 32B at Q4_K_M (~20.5 GB) is tight on a 24 GB RTX 4090; drop to a smaller quant for longer context.
- ✓The full 671B DeepSeek-R1 is not on this table — it needs ~376–404 GB at Q4 (datacenter only).
- ✓These are R1 reasoning distills, not DeepSeek-V3 (a chat model).
DeepSeek-R1 Distill VRAM by Quantization
VRAM figures include a small allowance for context and KV cache on top of the raw file size. Q4_K_M is the Ollama default and the best size-to-quality trade-off for reasoning. Use Q8_0 only if you have spare VRAM and want a marginal quality bump; FP16 is rarely worth it locally.
| Distill | Q4_K_M (VRAM) | Q8_0 (VRAM) | FP16 (VRAM) | Min GPU (Q4_K_M) |
|---|---|---|---|---|
| 1.5B | ~4 GB | ~5 GB | ~6 GB | Any 4 GB GPU / CPU |
| 7B (Qwen2.5) | ~5.5 GB | ~9.5 GB | ~16 GB | RTX 3060 12GB |
| 8B (Llama 3) | ~6 GB | ~10 GB | ~17 GB | RTX 3060 12GB |
| 14B (Qwen2.5) | ~9.5 GB | ~16 GB | ~29 GB | RTX 4060 Ti 16GB |
| 32B (Qwen2.5) | ~20.5 GB | ~35 GB | ~64 GB | RTX 4090 24GB (tight) |
| 70B (Llama 3) | ~42 GB | ~74 GB | ~140 GB | Dual-GPU / 48 GB |
Which Quantization Should You Pick?
**Use Q4_K_M for almost everything** — it is the Ollama default and keeps reasoning quality high while fitting the most models per GB. Pick it unless you have a specific reason not to.
**Use Q8_0 only with spare VRAM** — it roughly doubles the footprint for a marginal quality gain that rarely changes a reasoning answer. Worth it on a 24 GB card running the 14B, not much else.
**Skip FP16 locally** — at roughly 4× the Q4_K_M size it pushes the 32B to 64 GB-class hardware for no practical reasoning benefit over Q8_0.
V3 vs R1: Don't Confuse Them
**DeepSeek-V3 is a chat model; DeepSeek-R1 (and these distills) are reasoning models.** This table is for the R1 reasoning family only. If you are looking for V3, it is a 671B MoE chat model that is also not consumer-runnable — see the [DeepSeek V3 hardware bite](/prompt-bites/deepseek-v3-local-hardware-requirements).
Related Guides
- ▸Best DeepSeek Distill for Your GPU — match your card to a distill plus the Ollama command and expected tok/s
- ▸Best Local Reasoning Model 2026: DeepSeek-R1 Ranked — the full ranked guide with benchmarks
- ▸DeepSeek V3 Local Hardware Requirements — the V3 chat-model counterpart
Frequently Asked Questions
What is the VRAM for DeepSeek-R1-Distill-Qwen-32B?▾
How much does Q8_0 add over Q4_K_M?▾
Can I run the 70B distill on one GPU?▾
Is the full DeepSeek-R1 on this table?▾
Want the full breakdown?
Read the complete guide →Related Prompt Bites