Skip to main content
PromptQuorumPromptQuorum

DeepSeek-R1 Distill VRAM Cheatsheet (2026)

This page contains links to third-party products for reference. PromptQuorum is not enrolled in any affiliate program — these are plain links that earn no commission. Clicking links and your next steps are entirely your own responsibility. These links do not represent any endorsement or verification by PromptQuorum.

Quick Answer

At Q4_K_M (Ollama default): 1.5B ≈ 4 GB, 7B ≈ 5.5 GB, 8B ≈ 6 GB, 14B ≈ 9.5 GB, 32B ≈ 20.5 GB, 70B ≈ 42 GB. Q8_0 is about 2× the Q4_K_M size and FP16 about 4×, so the 32B at FP16 needs a 64 GB-class setup.

  • 1.5B: ~1.1 GB file, ~4 GB VRAM (or CPU) at Q4_K_M
  • 7B: ~4.7 GB file, ~5.5 GB VRAM — RTX 3060 12GB
  • 14B: ~9 GB file, ~9.5 GB VRAM — RTX 4060 Ti 16GB
  • 32B: ~19 GB file, ~20.5 GB VRAM — RTX 4090 24GB (tight)
  • 70B: ~40 GB file, ~42 GB VRAM — dual-GPU or 48 GB
  • Rule: Q8_0 ≈ 2× Q4_K_M; FP16 ≈ 4× Q4_K_M

Updated: 2026-06-19

Quantization & VRAMIntermediate

Key Takeaways

  • Q4_K_M (Ollama default) VRAM: 1.5B ~4 GB, 7B ~5.5 GB, 8B ~6 GB, 14B ~9.5 GB, 32B ~20.5 GB, 70B ~42 GB.
  • Q8_0 is roughly 2× the Q4_K_M size; FP16 is roughly 4× the Q4_K_M file size.
  • The 14B at Q4_K_M (~9.5 GB) is the sweet spot — fits a 16 GB card with context headroom.
  • The 32B at Q4_K_M (~20.5 GB) is tight on a 24 GB RTX 4090; drop to a smaller quant for longer context.
  • The full 671B DeepSeek-R1 is not on this table — it needs ~376–404 GB at Q4 (datacenter only).
  • These are R1 reasoning distills, not DeepSeek-V3 (a chat model).

DeepSeek-R1 Distill VRAM by Quantization

VRAM figures include a small allowance for context and KV cache on top of the raw file size. Q4_K_M is the Ollama default and the best size-to-quality trade-off for reasoning. Use Q8_0 only if you have spare VRAM and want a marginal quality bump; FP16 is rarely worth it locally.

DistillQ4_K_M (VRAM)Q8_0 (VRAM)FP16 (VRAM)Min GPU (Q4_K_M)
1.5B~4 GB~5 GB~6 GBAny 4 GB GPU / CPU
7B (Qwen2.5)~5.5 GB~9.5 GB~16 GBRTX 3060 12GB
8B (Llama 3)~6 GB~10 GB~17 GBRTX 3060 12GB
14B (Qwen2.5)~9.5 GB~16 GB~29 GBRTX 4060 Ti 16GB
32B (Qwen2.5)~20.5 GB~35 GB~64 GBRTX 4090 24GB (tight)
70B (Llama 3)~42 GB~74 GB~140 GBDual-GPU / 48 GB

Which Quantization Should You Pick?

**Use Q4_K_M for almost everything** — it is the Ollama default and keeps reasoning quality high while fitting the most models per GB. Pick it unless you have a specific reason not to.

**Use Q8_0 only with spare VRAM** — it roughly doubles the footprint for a marginal quality gain that rarely changes a reasoning answer. Worth it on a 24 GB card running the 14B, not much else.

**Skip FP16 locally** — at roughly 4× the Q4_K_M size it pushes the 32B to 64 GB-class hardware for no practical reasoning benefit over Q8_0.

V3 vs R1: Don't Confuse Them

**DeepSeek-V3 is a chat model; DeepSeek-R1 (and these distills) are reasoning models.** This table is for the R1 reasoning family only. If you are looking for V3, it is a 671B MoE chat model that is also not consumer-runnable — see the [DeepSeek V3 hardware bite](/prompt-bites/deepseek-v3-local-hardware-requirements).

Related Guides

Frequently Asked Questions

What is the VRAM for DeepSeek-R1-Distill-Qwen-32B?
About 20.5 GB at Q4_K_M, which fits a 24 GB RTX 4090 but leaves little room for long context. At Q8_0 it needs ~35 GB and at FP16 ~64 GB.
How much does Q8_0 add over Q4_K_M?
Roughly 2× the VRAM. For most reasoning tasks the quality gain is marginal, so Q4_K_M is the better default unless you have spare VRAM.
Can I run the 70B distill on one GPU?
No. At ~42 GB (Q4_K_M) it exceeds any single consumer card. Use two 24 GB GPUs or a 48 GB workstation card.
Is the full DeepSeek-R1 on this table?
No. The full 671B R1 needs ~376–404 GB at Q4 and is datacenter-only. This cheatsheet covers the consumer-runnable distills (1.5B–70B).