Skip to main content
PromptQuorumPromptQuorum

Best DeepSeek Distill for Your GPU (2026)

This page contains links to third-party products for reference. PromptQuorum is not enrolled in any affiliate program — these are plain links that earn no commission. Clicking links and your next steps are entirely your own responsibility. These links do not represent any endorsement or verification by PromptQuorum.

Quick Answer

Find your card: RTX 3060 12GB → 7B, RTX 4060 Ti 16GB → 14B, RTX 4070/4080 → 14B or 32B, RTX 4090 → 32B, dual-GPU/48 GB → 70B. For the best small model on 8 GB, run DeepSeek-R1-0528-Qwen3-8B. Each runs with one Ollama command at Q4_K_M.

  • RTX 3060 12GB → deepseek-r1:7b — ~30–40 tok/s
  • RTX 4060 Ti 16GB → deepseek-r1:14b — ~25–35 tok/s (recommended)
  • RTX 4070 / 4080 → deepseek-r1:14b or :32b — 14B ~40–50, 32B ~15–20 tok/s
  • RTX 4090 24GB → deepseek-r1:32b — ~30–40 tok/s, beats o1-mini
  • Dual-GPU / 48 GB → deepseek-r1:70b — ~12–18 tok/s
  • 8 GB card, best small → DeepSeek-R1-0528-Qwen3-8B

Updated: 2026-06-19

Quantization & VRAMIntermediate

Key Takeaways

  • RTX 3060 12GB → 7B distill; RTX 4060 Ti 16GB → 14B (the sweet spot); RTX 4090 → 32B (beats o1-mini).
  • Dual-GPU or 48 GB → 70B distill, the strongest of the six.
  • On 8 GB, the best small model is DeepSeek-R1-0528-Qwen3-8B.
  • Every model installs at Q4_K_M with one command, e.g. `ollama run deepseek-r1:14b`.
  • Set temperature to 0.6 and use no system prompt to avoid R1 repetition failures.
  • This is the R1 reasoning family — not DeepSeek-V3, which is a chat model.

GPU → DeepSeek-R1 Distill → Ollama Command

Find the GPU you own in the first column and read across. The tok/s figures are approximate for Q4_K_M reasoning workloads and vary with context length and sampling settings. When two models fit, the larger one reasons better; the smaller one is faster.

GPU (VRAM)Best DistillOllama CommandExpected tok/s
RTX 3060 12GB (8 GB tier)DeepSeek-R1-Distill-Qwen-7Bollama run deepseek-r1:7b~30–40
8 GB, best smallDeepSeek-R1-0528-Qwen3-8Bollama run deepseek-r1-0528-qwen3:8b~30–40
RTX 4060 Ti 16GBDeepSeek-R1-Distill-Qwen-14Bollama run deepseek-r1:14b~25–35
RTX 4070 / 408014B (fast) or 32B (if 16 GB+)ollama run deepseek-r1:14b14B ~40–50
RTX 4090 24GBDeepSeek-R1-Distill-Qwen-32Bollama run deepseek-r1:32b~30–40
Dual-GPU / 48 GBDeepSeek-R1-Distill-Llama-70Bollama run deepseek-r1:70b~12–18

How to Use This Table in 3 Steps

Three lines: (1) find your GPU and its VRAM, (2) run the matching Ollama command, (3) set temperature 0.6 and clear the system prompt. If a model is too slow, drop one tier; if you have spare VRAM, move up a tier for better reasoning.

V3 vs R1: This Table Is R1 Only

**DeepSeek-R1 is the reasoning family these commands install; DeepSeek-V3 is a separate chat model.** Do not expect a V3 experience from these distills — they are tuned to show step-by-step reasoning for math and logic. V3 is also a 671B MoE and not consumer-runnable; see the [DeepSeek V3 hardware bite](/prompt-bites/deepseek-v3-local-hardware-requirements).

Related Guides

Frequently Asked Questions

What DeepSeek distill runs on an RTX 4090?
DeepSeek-R1-Distill-Qwen-32B. At Q4_K_M it needs ~20.5 GB, fits a 24 GB RTX 4090 (tight on context), and beats OpenAI o1-mini on several reasoning benchmarks. Command: `ollama run deepseek-r1:32b`.
What is the best DeepSeek distill for an 8 GB GPU?
DeepSeek-R1-0528-Qwen3-8B is the strongest small reasoning distill and fits 8 GB. The original 7B distill (`ollama run deepseek-r1:7b`) is the well-supported alternative.
Why is my distill slow?
Usually VRAM overflow — if the model does not fit, it spills to system RAM and throughput collapses. Drop one tier (e.g., 32B → 14B) so the model fits entirely in VRAM.
Do I need to pick a quantization?
No. The `ollama run deepseek-r1:` commands default to Q4_K_M, the best size-to-quality trade-off. See the VRAM cheatsheet if you want Q8_0 or FP16 figures.