Key Takeaways
- 7B models: 8GB minimum (Q4), 10GB comfortable (Q5), 14GB for Q8 full precision.
- 13B models: 10GB minimum (Q4), 12-14GB comfortable (Q5), 16GB for Q8.
- 70B models: 24GB minimum (Q4), 32GB+ for Q5/Q8 or multi-user setup.
- Quantization (Q4, Q5, Q8) reduces VRAM by 50-75% vs. full precision (FP32).
- Always over-allocate by 1-2GB for overhead (KV cache, optimizer state, system OS).
- Batch size β VRAM per inference. Single inference uses same VRAM regardless of batch (batch processes sequentially).
- More VRAM doesn't speed up single-prompt inference. It only helps with multi-user/multi-request setups.
VRAM Rule of Thumb β Quick Reference
Don't have time for the formula? Use these simple rules:
- 3B models (Phi, StableLM): 4 GB VRAM minimum
- 7B models (Llama, Mistral, Qwen): 8 GB VRAM (Q4), 10 GB (Q5)
- 13B models (Llama 3.1, Mistral): 12 GB VRAM minimum (Q4)
- 22B models (Qwen2.5, Gemma): 16 GB VRAM (Q4)
- 70B models (Llama 3.3, Qwen 3.6): 24β32 GB VRAM (Q4βQ5)
- MoE models: Use active parameters only. Example: Llama 4 Scout has 17B active = ~9 GB VRAM, not 44 GB
# Quick VRAM formula (memorize this)
VRAM (GB) β Model Size (B) Γ· 8 # at Q4 quantization
# Examples:
7B Γ· 8 = 0.875 GB per billion β 8 GB total
70B Γ· 8 = 8.75 GB per billion β 48 GB total
# For other quantizations:
Q8 (8-bit): Model Size Γ· 4
Q5 (5-bit): Model Size Γ· 5
FP32 (full): Model Size Γ 4What Is the VRAM Formula for LLMs?
VRAM (GB) = (Model Size in Billions Γ 4 bytes Γ Quantization Factor)
- Model size: Number of parameters (7B, 13B, 70B, etc.)
- 4 bytes: FP32 precision (1 byte = 8 bits)
- Quantization factor: 1.0 (FP32), 0.5 (Q8), 0.25 (Q4)
Example: Llama 3 70B, FP32, no quantization:
70 billion Γ 4 bytes = 280GB. Impractical.
Llama 3 70B, Q4 (4-bit) quantization:
70 billion Γ 4 bytes Γ 0.25 = 70GB allocated, ~24GB used after compression.
MoE Models (Sparse): Use active parameter count only. Example: Llama 4 Scout has 109B total parameters but only 17B active at once. VRAM = 17B Γ 0.5 bytes (Q4) β 9 GB β not the 44 GB a naive total-parameter calculation would suggest.
How Much VRAM Does Each Model Size Need?
| Model Size | FP32 (No Quantization) | Q8 (8-bit) | Q5 (5-bit) | Q4 (4-bit) | Recommended GPU |
|---|---|---|---|---|---|
| 3B (Phi, StableLM) | 12 GB | 6 GB | 4 GB | 3 GB | RTX 2060 6 GB or RTX 5070 12 GB |
| 7B (Llama 2, Mistral) | 28 GB | 14 GB | 9 GB | 7 GB | RTX 3060 12 GB or RTX 5070 12 GB |
| 13B (Llama 2, Mistral) | 52 GB | 26 GB | 17 GB | 13 GB | RTX 3090 24 GB or RTX 5080 16 GB |
| 22B (Qwen, Gemma) | 88 GB | 44 GB | 28 GB | 22 GB | RTX 4090 24 GB (Q4) or RTX 5090 32 GB |
| 70B (Llama 3, Qwen) | 280 GB | 140 GB | 88 GB | 70 GB | 2Γ RTX 4090 (24 GB each), or 1Γ H100 80 GB |
| Qwen 3.6 35B-A3B (3B active, MoE)* | 12 GB | 3 GB | 2 GB | 2 GB | RTX 2060 6 GB or RTX 5070 12 GB |
| DeepSeek V4-Flash (13B active / 284B total, MoE)* | 52 GB | 13 GB | 8 GB | 7 GB | RTX 3060 12 GB or RTX 5070 12 GB |
| Llama 4 Scout (17B active / 109B total, MoE)* | 68 GB | 17 GB | 11 GB | 9 GB | RTX 3090 24 GB or RTX 5080 16 GB |
| Kimi K2.6 (42B active / 1T total, MoE)* | 168 GB | 42 GB | 27 GB | 21 GB | 2Γ RTX 4090 or RTX 5090 32 GB (Q4 only) |
* MoE models: VRAM is calculated from active parameters only, not total model size.
MoE Models Need Far Less VRAM Than Their Size Suggests
Mixture-of-Experts (MoE) models split their parameters across many "expert" sub-networks and activate only a fraction for each token. VRAM depends on active parameters β the subset loaded during inference β not total parameters.
Dense model rule: VRAM = total_params Γ bytes_per_param
MoE model rule: VRAM = active_params Γ bytes_per_param
Example: Llama 4 Scout has 109B total parameters but only 17B activate per token. At Q4 quantization: 17B Γ 0.5 bytes β 9 GB β runnable on a single RTX 3090, versus the ~55 GB a dense 109B model would need.
Qwen 3.6 35B-A3B is even more extreme: 3B active out of 35B total gives it the VRAM footprint of a small 3B dense model while delivering 35B-class output quality.
How Does Quantization Reduce VRAM Requirements?
Quantization reduces the number of bits needed to represent each model parameter.
- FP32 (32-bit float): Full precision. 1 parameter = 4 bytes. No loss. Slowest.
- Q8 (8-bit): 1 parameter = 1 byte. ~6% accuracy loss. 75% VRAM savings.
- Q5 (5-bit): 1 parameter = 0.625 bytes. ~2% accuracy loss. 84% VRAM savings.
- Q4 (4-bit): 1 parameter = 0.5 bytes. ~1% accuracy loss. 87.5% VRAM savings.
For most users, Q4 is the sweet spot: imperceptible accuracy loss, 87% smaller VRAM footprint.
As of April 2026, Q4 is standard. Q5 and Q8 are available if you have extra VRAM and want marginal quality gains.
VRAM determines model size, but prompt design determines output quality. Techniques like chain-of-thought and few-shot prompting can close the quality gap between smaller and larger models. Explore the full prompt engineering toolkit to get more from the models your hardware supports. If you have 12β16 GB VRAM and want a concrete coding workload to put that toolkit against, Replace GitHub Copilot With a Local LLM maps the Continue.dev + Ollama + Qwen3-Coder stack onto exactly those VRAM tiers.
What About Batch Size and Multi-User Inference?
Batch size affects throughput (tokens per second), not single-inference latency.
A single user prompting "What is 2+2?" uses the same VRAM whether batch size is 1 or 32.
Batch size = 32 means processing 32 prompts in parallel. This uses ~32Γ more VRAM, but generates 32 responses faster.
For single-user (typical local LLM usage): Batch size = 1. VRAM is model size + 1-2GB overhead.
For multi-user server: Allocate batch size Γ model VRAM. A 70B model at batch=4 needs ~96GB (24GB Γ 4).
Do You Need More VRAM Than the Model Size?
Yes. Beyond the model weights, add:
- KV cache (key-value cache for context): ~5-10% extra VRAM.
- Optimizer state (if fine-tuning): 2-4Γ model size (only relevant for training, not inference).
- System overhead (OS, drivers, Ollama/LM Studio runtime): ~1-2GB.
Rule: A 70B model Q4 (20GB) + KV cache (2GB) + system (2GB) = ~24GB allocated.
Always buy GPUs with at least 1-2GB headroom above theoretical minimums.
Common VRAM Misconceptions
- More VRAM = faster inference. False. VRAM size doesn't affect speed. Memory bandwidth (GB/sec) does, and that's fixed per GPU.
- Batch size = sequential token limit. False. Batch size = parallel requests. Single inference uses batch=1 regardless of VRAM size.
- You need 24GB for any 70B model. False. Q4 needs 24GB. Q8 needs 48GB. Depends on quantization.
VRAM Calculator
Select your model size and quantization to estimate VRAM requirements.
Popular Models
Base Model
6.50 GB
Context OH
1.50 GB
Batch OH
0.00 GB
System OH
1.00 GB
Total Minimum
9.00 GB
Recommended (with 25% safety margin)
11.25 GB
π Look for a GPU with at least 11.25 GB VRAM
Compatible GPUs
RTX 3060 (12 GB)
0.8 GB headroom
RTX 4070 (12 GB)
0.8 GB headroom
RTX 4070 Ti (12 GB)
0.8 GB headroom
RTX 4080 (16 GB)
4.8 GB headroom
RTX 4090 (24 GB)
12.8 GB headroom
Mac mini M5 (16 GB) (16 GB)
4.8 GB headroom
Mac mini M4 (16 GB) (16 GB)
4.8 GB headroom
MacBook Pro (24 GB) (24 GB)
12.8 GB headroom
M3 Max (36 GB) (36 GB)
24.8 GB headroom
π‘ Pro Tips:
- Always use the "with safety margin" figure when buying a GPU
- Q4 gives 90-95% quality with 25% size reduction. Q5 is better if you have room
- Context overhead grows with conversation length. Budget 1-3 GB for typical usage
- Batch size matters for multi-user APIs. Single-user chat can ignore batch overhead
π Share this configuration:
FAQ
Can I run Mistral 7B on a 6GB GPU?
Barely, at Q4 with tight overhead. Practically, no. Buy at least 8GB. You'll hit OOM errors with 6GB.
How much VRAM do I need for fine-tuning a 7B model?
For LoRA: 12-16GB. Full fine-tuning: 28GB+. Fine-tuning requires optimizer state (2-4Γ model VRAM), not just inference.
Is 12GB enough for Llama 3 13B?
At Q4, yes barely. At Q5 or Q8, no. 12GB is cutting it close. 16GB is comfortable.
Do I need 24GB for a 70B model?
At Q4, yes. At Q5+, no. Higher quantization (Q5, Q8) need 32GB+ for 70B.
Does increasing batch size reduce VRAM for single inference?
No. Single inference always uses batch=1 VRAM. Batch size only helps throughput (multi-user scenarios).
What's the best quantization for accuracy?
Q8 is nearly imperceptible loss. Q5 is ~2% loss. Q4 is ~1% loss. For most, Q4 is the sweet spot.
Can I offload some VRAM to CPU RAM?
Yes, via layer-splitting (NVLink). Llama.cpp and Ollama support this. Performance drops 30-50% but it works. Under 8 GB VRAM? See **which models run fastest on your exact hardware tier** β CPU-only, 4 GB, 6 GB, and 8 GB VRAM benchmarks with real tok/sec numbers.
Sources
- NVIDIA CUDA memory architecture and shared memory model documentation
- Ollama and LM Studio official documentation: model VRAM requirements and quantization specs
- llama.cpp project GitHub: quantization levels (Q4, Q5, Q8) and memory calculations