Do I need 24GB VRAM for a 70B model?

At Q4, yes -- 70B models require ~20-24GB VRAM. At Q5+, you need 32GB+. Higher quantization adds VRAM needs proportionally.

What quantization level has the best accuracy?

Q8 is nearly imperceptible quality loss. Q5 is ~2% degradation. Q4 is ~1% degradation. For most tasks, Q4 is the sweet spot between VRAM savings and quality.

How much more VRAM does Q8 need vs Q4?

Q8 uses 2× the VRAM of Q4. A 7B model at Q4 needs ~4-5GB; at Q8, ~8-9GB. Always check quantization level before buying a GPU.

Can I run a 70B model on two GPUs?

Yes. Two RTX 5090s (24GB each) combine for 48GB VRAM -- enough for a 70B model at Q4. llama.cpp and Ollama support multi-GPU via tensor parallelism and --n-gpu-layers.

Home/Local LLMs/How Much VRAM for Local LLM? 7B to 70B Charts (2026)

GPU Buying Guides

How Much VRAM for Local LLM? 7B to 70B Charts (2026)

Last updated: June 2026·7 min·By Hans Kuepper · Founder of PromptQuorum, multi-model AI dispatch tool · PromptQuorum

Read in:

🇺🇸en 🇩🇪de 🇫🇷fr 🇯🇵ja 🇨🇳zh 🇪🇸es 🇧🇷pt 🇸🇦ar 🇰🇷ko

For 7B models, you need 8GB VRAM; for 13B-22B, 12-16GB; for 70B, 24GB minimum. As of April 2026, these numbers assume Q4 (4-bit) quantization.

For 7B models, you need 8GB VRAM; for 13B-22B, 12-16GB; for 70B, 24GB minimum. As of April 2026, these numbers assume Q4 (4-bit) quantization. Full-precision (FP32) models need 2-3× more VRAM and are rarely practical on consumer GPUs. The formula is: Model size (billions) × 2 bytes (FP32) ÷ quantization factor.

Key Takeaways

7B models: 8GB minimum (Q4), 10GB comfortable (Q5), 14GB for Q8 full precision.
13B models: 10GB minimum (Q4), 12-14GB comfortable (Q5), 16GB for Q8.
70B models: 24GB minimum (Q4), 32GB+ for Q5/Q8 or multi-user setup.
Quantization (Q4, Q5, Q8) reduces VRAM by 50-75% vs. full precision (FP32).
Always over-allocate by 1-2GB for overhead (KV cache, optimizer state, system OS).
Batch size ≠ VRAM per inference. Single inference uses same VRAM regardless of batch (batch processes sequentially).
More VRAM doesn't speed up single-prompt inference. It only helps with multi-user/multi-request setups.

VRAM Rule of Thumb — Quick Reference

Don't have time for the formula? Use these simple rules:

Once you know your VRAM budget, see which GPUs fit each tier →

3B models (Phi, StableLM): 4 GB VRAM minimum
7B models (Llama, Mistral, Qwen): 8 GB VRAM (Q4), 10 GB (Q5)
13B models (Llama 3.3, Mistral): 12 GB VRAM minimum (Q4)
22B models (Qwen3, Gemma): 16 GB VRAM (Q4)
70B models (Llama 3.3, Qwen 3.6): 24–32 GB VRAM (Q4–Q5)
MoE models: VRAM scales with weights you must hold in memory. Example: Qwen 3.6 35B-A3B (3B active) fits a tiny ~2 GB footprint, while Llama 4 Scout (17B active / 109B total) still needs ~55 GB at Q4 because all experts stay resident

bash

# Quick VRAM formula (memorize this)
VRAM (GB) ≈ Model Size (B) ÷ 8  # at Q4 quantization

# Examples:
7B ÷ 8 = 0.875 GB per billion ≈ 8 GB total
70B ÷ 8 = 8.75 GB per billion ≈ 48 GB total

# For other quantizations:
Q8 (8-bit): Model Size ÷ 4
Q5 (5-bit): Model Size ÷ 5
FP32 (full): Model Size × 4

What Is the VRAM Formula for LLMs?

VRAM (GB) = (Model Size in Billions × 4 bytes × Quantization Factor)

Model size: Number of parameters (7B, 13B, 70B, etc.)

4 bytes: FP32 precision (1 byte = 8 bits)

Quantization factor: 1.0 (FP32), 0.5 (Q8), 0.25 (Q4)

Example: Llama 3 70B, FP32, no quantization:

70 billion × 4 bytes = 280GB. Impractical.

Llama 3 70B, Q4 (4-bit) quantization:

70 billion × 4 bytes × 0.25 = 70GB allocated, ~24GB used after compression.

MoE Models (Sparse): Active parameters drive compute, but all experts must stay loaded in VRAM. Example: Llama 4 Scout has 109B total parameters with 17B active per token. At Q4 it still needs ~55 GB of VRAM to hold every expert — it only squeezes into a 24 GB GPU at an aggressive 1.78-bit quant (~20 tok/s). Compute is cheap; memory is the constraint.

How Much VRAM Does Each Model Size Need?

Model Size	FP32 (No Quantization)	Q8 (8-bit)	Q5 (5-bit)	Q4 (4-bit)	Recommended GPU
3B (Phi, StableLM)	12 GB	6 GB	4 GB	3 GB	RTX 2060 6 GB or RTX 5070 12 GB
7B (Llama 3.3, Mistral)	28 GB	14 GB	9 GB	7 GB	RTX 3060 12 GB or RTX 5070 12 GB
13B (Llama 3.3, Mistral)	52 GB	26 GB	17 GB	13 GB	RTX 3090 24 GB or RTX 5080 16 GB
22B (Qwen, Gemma)	88 GB	44 GB	28 GB	22 GB	RTX 4090 24 GB (Q4) or RTX 5090 32 GB
70B (Llama 3, Qwen)	280 GB	140 GB	88 GB	70 GB	2× RTX 4090 (24 GB each), or 1× H100 80 GB
Qwen 3.6 35B-A3B (3B active, MoE)*	12 GB	3 GB	2 GB	2 GB	RTX 2060 6 GB or RTX 5070 12 GB
DeepSeek V4-Flash (13B active / 284B total, MoE)*	52 GB	13 GB	8 GB	7 GB	RTX 3060 12 GB or RTX 5070 12 GB
Llama 4 Scout (17B active / 109B total, MoE)†	436 GB	109 GB	68 GB	55 GB	2× RTX 4090 (48 GB) — fits 24 GB only at 1.78-bit (~20 tok/s)
gpt-oss:20b (3.6B active / 21B total, MoE)*	84 GB	21 GB	13 GB	12 GB	RTX 5070 12 GB or any 16 GB GPU
Kimi K2.6 (32B active / 1T total, MoE)*	128 GB	32 GB	20 GB	16 GB	2× RTX 4090 or RTX 5090 32 GB (Q4 only)

* MoE models: VRAM is calculated from active parameters only, not total model size. † Llama 4 Scout keeps all 109B parameters resident, so it needs ~55 GB at Q4 despite only 17B active per token.

MoE Models Need Far Less VRAM Than Their Size Suggests

Mixture-of-Experts (MoE) models split their parameters across many "expert" sub-networks and activate only a fraction for each token. Active parameters cut compute and speed up inference, but for most MoE models every expert must still be loaded into VRAM — so memory usage tracks total parameters, not active ones.

Dense model rule: VRAM = total_params × bytes_per_param

MoE model rule (compute): active_params drive tokens/sec — but VRAM still scales with total resident weights.

Example: Llama 4 Scout has 109B total parameters with only 17B active per token. It runs fast for its size, but at Q4 it still needs ~55 GB of VRAM to hold all experts — out of reach for a single 24 GB GPU except at an aggressive 1.78-bit quant (~20 tok/s on an RTX 4090).

Some runtimes can stream or offload inactive experts to system RAM, trading speed for a smaller VRAM footprint. The headline takeaway: do not assume an MoE model fits in active-parameter-sized VRAM — check the actual on-disk size at your quant level.

How Does Quantization Reduce VRAM Requirements?

Quantization reduces the number of bits needed to represent each model parameter.

FP32 (32-bit float): Full precision. 1 parameter = 4 bytes. No loss. Slowest.

Q8 (8-bit): 1 parameter = 1 byte. ~6% accuracy loss. 75% VRAM savings.

Q5 (5-bit): 1 parameter = 0.625 bytes. ~2% accuracy loss. 84% VRAM savings.

Q4 (4-bit): 1 parameter = 0.5 bytes. ~1% accuracy loss. 87.5% VRAM savings.

For most users, Q4 is the sweet spot: imperceptible accuracy loss, 87% smaller VRAM footprint.

As of April 2026, Q4 is standard. Q5 and Q8 are available if you have extra VRAM and want marginal quality gains.

VRAM determines model size, but prompt design determines output quality. Techniques like chain-of-thought and few-shot prompting can close the quality gap between smaller and larger models. Explore the full prompt engineering toolkit to get more from the models your hardware supports. If you have 12–16 GB VRAM and want a concrete coding workload to put that toolkit against, Replace GitHub Copilot With a Local LLM maps the Continue.dev + Ollama + Qwen3-Coder stack onto exactly those VRAM tiers.

What About Batch Size and Multi-User Inference?

Batch size affects throughput (tokens per second), not single-inference latency.

A single user prompting "What is 2+2?" uses the same VRAM whether batch size is 1 or 32.

Batch size = 32 means processing 32 prompts in parallel. This uses ~32× more VRAM, but generates 32 responses faster.

For single-user (typical local LLM usage): Batch size = 1. VRAM is model size + 1-2GB overhead.

For multi-user server: Allocate batch size × model VRAM. A 70B model at batch=4 needs ~96GB (24GB × 4).

Do You Need More VRAM Than the Model Size?

Yes. Beyond the model weights, add:

KV cache (key-value cache for context): ~5-10% extra VRAM.

Optimizer state (if fine-tuning): 2-4× model size (only relevant for training, not inference).

System overhead (OS, drivers, Ollama/LM Studio runtime): ~1-2GB.

Rule: A 70B model Q4 (20GB) + KV cache (2GB) + system (2GB) = ~24GB allocated.

Always buy GPUs with at least 1-2GB headroom above theoretical minimums.

Common VRAM Misconceptions

More VRAM = faster inference. False. VRAM size doesn't affect speed. Memory bandwidth (GB/sec) does, and that's fixed per GPU.
Batch size = sequential token limit. False. Batch size = parallel requests. Single inference uses batch=1 regardless of VRAM size.
You need 24GB for any 70B model. False. Q4 needs 24GB. Q8 needs 48GB. Depends on quantization.

VRAM Calculator

Select your model size and quantization to estimate VRAM requirements.

Popular Models

Model Size

Quantization

Context Length

Batch Size

Use Case

Base Model

6.50 GB

Context OH

1.50 GB

Batch OH

0.00 GB

System OH

1.00 GB

Total Minimum

9.00 GB

Recommended (with 25% safety margin)

11.25 GB

👉 Look for a GPU with at least 11.25 GB VRAM

Compatible GPUs

RTX 3060 (12 GB)

0.8 GB headroom

⚠️ Tight

RTX 4070 (12 GB)

0.8 GB headroom

⚠️ Tight

RTX 4070 Ti (12 GB)

0.8 GB headroom

⚠️ Tight

RTX 4080 (16 GB)

4.8 GB headroom

✅ Fits

RTX 4090 (24 GB)

12.8 GB headroom

✅ Fits

Mac mini M5 (16 GB) (16 GB)

4.8 GB headroom

✅ Fits

Mac mini M4 (16 GB) (16 GB)

4.8 GB headroom

✅ Fits

MacBook Pro (24 GB) (24 GB)

12.8 GB headroom

✅ Fits

M3 Max (36 GB) (36 GB)

24.8 GB headroom

✅ Fits

💡 Pro Tips:

Always use the "with safety margin" figure when buying a GPU
Q4 gives 90-95% quality with 25% size reduction. Q5 is better if you have room
Context overhead grows with conversation length. Budget 1-3 GB for typical usage
Batch size matters for multi-user APIs. Single-user chat can ignore batch overhead

📋 Share this configuration:

FAQ

Can I run Mistral Small on a 6GB GPU?

Barely, at Q4 with tight overhead. Practically, no. Buy at least 8GB. You'll hit OOM errors with 6GB.

How much VRAM do I need for fine-tuning a 7B model?

For LoRA: 12-16GB. Full fine-tuning: 28GB+. Fine-tuning requires optimizer state (2-4× model VRAM), not just inference.

Is 12GB enough for Llama 3 13B?

At Q4, yes barely. At Q5 or Q8, no. 12GB is cutting it close. 16GB is comfortable.

Do I need 24GB for a 70B model?

At Q4, yes. At Q5+, no. Higher quantization (Q5, Q8) need 32GB+ for 70B.

Does increasing batch size reduce VRAM for single inference?

No. Single inference always uses batch=1 VRAM. Batch size only helps throughput (multi-user scenarios).

What's the best quantization for accuracy?

Q8 is nearly imperceptible loss. Q5 is ~2% loss. Q4 is ~1% loss. For most, Q4 is the sweet spot.

Can I offload some VRAM to CPU RAM?

Yes, via layer-splitting (NVLink). Llama.cpp and Ollama support this. Performance drops 30-50% but it works. Under 8 GB VRAM? See **which models run fastest on your exact hardware tier** — CPU-only, 4 GB, 6 GB, and 8 GB VRAM benchmarks with real tok/sec numbers.

Sources

NVIDIA CUDA memory architecture and shared memory model documentation
Ollama and LM Studio official documentation: model VRAM requirements and quantization specs
llama.cpp project GitHub: quantization levels (Q4, Q5, Q8) and memory calculations

You know your VRAM budget. Now pick the right GPU for it.

Best Budget GPUs for Local LLMs →

A Note on Third-Party Facts

This article references third-party AI models, benchmarks, prices, and licenses. The AI landscape changes rapidly. Benchmark scores, license terms, model names, and API prices can shift between the time of writing and the time you read this. Before making deployment or compliance decisions based on this article, verify current figures on each provider’s official source: Hugging Face model cards for licenses and benchmarks, provider websites for API pricing, and EUR-Lex for current GDPR and EU AI Act text. This article reflects publicly available information as of May 2026.

Run PromptQuorum with a local LLM, your own API keys, or both — you pick the backend.

Join the PromptQuorum Waitlist →

← Back to Local LLMs

How Much VRAM for Local LLM? 7B to 70B Charts (2026)

VRAM Rule of Thumb — Quick Reference

What Is the VRAM Formula for LLMs?

How Much VRAM Does Each Model Size Need?

MoE Models Need Far Less VRAM Than Their Size Suggests

How Does Quantization Reduce VRAM Requirements?

What About Batch Size and Multi-User Inference?

Do You Need More VRAM Than the Model Size?

Common VRAM Misconceptions

VRAM Calculator

Compatible GPUs

FAQ

Can I run Mistral Small on a 6GB GPU?

How much VRAM do I need for fine-tuning a 7B model?

Is 12GB enough for Llama 3 13B?

Do I need 24GB for a 70B model?

Does increasing batch size reduce VRAM for single inference?

What's the best quantization for accuracy?

Can I offload some VRAM to CPU RAM?

Related Reading

Sources

A Note on Third-Party Facts