Should I buy a GPU exactly at the calculated VRAM size?

No. Always buy 25% more VRAM than calculated. If you calculate 18 GB need, buy a 24 GB GPU. This safety margin accounts for context growth, batching, and system processes. Exact-fit purchases leave no headroom and cause out-of-memory errors.

What is the best GPU for 13B models in 2026?

RTX 4070 Ti (12 GB) for single-user chat at Q4–Q5. RTX 4080 (16 GB) if you want Q8 quality or batch processing. M5 Max (36 GB) if using Mac. All three comfortably run Llama 3.3, Qwen3, and Mistral 3.1 13B models.

Can I run a 70B model on RTX 4090?

Only with Q4 quantization (35 GB) + CPU offloading (slow, 3–5 tok/sec), which is impractical. Better approach: use a 32B model at Q5 (20 GB) on RTX 4090 for fast, quality responses. Or upgrade to RTX 5090 (32 GB) for 70B at Q4 without offloading.

Home/Local LLMs/VRAM Calculator 2026: 7B/13B/70B LLM GPU Requirements (Q4, Q5, Q8)

Hardware & Performance

VRAM Calculator 2026: 7B/13B/70B LLM GPU Requirements (Q4, Q5, Q8)

Last updated: April 2026·10 min read·By Hans Kuepper · Founder of PromptQuorum, multi-model AI dispatch tool · PromptQuorum

Read in:

🇺🇸en 🇩🇪de 🇫🇷fr 🇯🇵ja 🇨🇳zh 🇪🇸es 🇧🇷pt 🇸🇦ar 🇰🇷ko

This guide explains how to calculate exact VRAM requirements for any model and hardware combination. The formula is simple: (Model Size GB × Quantization Bits) ÷ 8 = VRAM Needed.

Interactive VRAM calculator for local LLMs. Enter model size, quantization, context length, and batch size to calculate exact GPU VRAM needs. Works for 1B–405B models at FP16, Q8, Q5, Q4 quantization. Updated April 2026 with RTX 4090, 4080, 3060 fit analysis and overhead calculations.

Slide Deck: VRAM Calculator 2026: 7B/13B/70B LLM GPU Requirements (Q4, Q5, Q8)

The slide deck below covers: VRAM formula (Model Billions × Quantization Bits) ÷ 8, quantization levels Q2–FP16 with quality trade-offs, quick reference table (3B–70B models), real-world GPU scenarios (RTX 4090, 4080, M5 Max), and regional compliance (EU GDPR, Japan APPI, China Data Security Law). Download the PDF as a VRAM calculator reference card.

Browse the slides below or download as PDF for offline reference. Download Reference Card (PDF)

Key Takeaways

VRAM = (Model Size × Quantization Bits) ÷ 8
FP16 = 16 bits, Q8 = 8, Q5 = 5, Q4 = 4 bits
Example: 13B model at Q4 = (13 × 4) ÷ 8 = 6.5 GB
Always add 25% buffer for context, system overhead, and safe margin
As of April 2026, this formula is accurate within ±10%

Quick Facts: VRAM Requirements by GPU

RTX 4090 (24 GB): Llama 3.3 7B at Q4 (3.5 GB), 13B at Q5 (8.1 GB), 70B at Q4 with offloading
RTX 4080 (16 GB): Llama 3.3 7B at Q4 (3.5 GB), 13B at Q5 (8.1 GB), 32B at Q4 (16 GB)
RTX 4070 Ti (12 GB): Llama 3.3 7B at Q4 (3.5 GB), 13B at Q5 (8.1 GB with tight headroom)
M5 Max Mac (36 GB unified): Llama 3.3 13B at FP16 (26 GB), 70B not possible without extreme quantization
Rule of thumb: Always budget 25–40% extra VRAM for context, batching, and system overhead beyond the formula result

In One Sentence

VRAM required (GB) equals model parameters in billions multiplied by quantization bits (16 for FP16, 8 for Q8, 4 for Q4, etc.), divided by 8.

In Plain Terms

Think of VRAM like bookshelf space. Bigger books (models with more parameters like 70B) take more shelf space. Smaller books (Q4 quantization) take less space than larger ones (FP16). The formula tells you exactly how many "shelves" (GB) you need. Always leave extra empty shelves for conversations, multiple requests at once, and system software.

What Is the VRAM Formula?

The formula for VRAM requirement is deceptively simple:

💡 Pro Tip: This formula calculates model weights only. Real VRAM usage is 25–40% higher due to context, batching, and system overhead. Always add a safety margin.

bash

VRAM (GB) = (Model Size in Billions × Quantization Bits) ÷ 8

Example:
- 7B model at 4-bit quantization
- (7 × 4) ÷ 8 = 3.5 GB

- 13B model at 5-bit quantization
- (13 × 5) ÷ 8 = 8.125 GB

- 70B model at 8-bit quantization
- (70 × 8) ÷ 8 = 70 GB

VRAM formula with 3 calculation examples: 7B model at Q4 = 3.5 GB, 13B at Q5 = 8.1 GB, 70B at Q8 = 70 GB. Always add 25–40% buffer for context, batching, and system overhead.

Interactive VRAM Calculator

Use this calculator to compute exact VRAM requirements for any combination of model, quantization, context, and batch size. Select your configuration and see which GPUs fit.

Popular Models

Model Size

Quantization

Context Length

Batch Size

Use Case

Base Model

6.50 GB

Context OH

1.50 GB

Batch OH

0.00 GB

System OH

1.00 GB

Total Minimum

9.00 GB

Recommended (with 25% safety margin)

11.25 GB

👉 Look for a GPU with at least 11.25 GB VRAM

Compatible GPUs

RTX 3060 (12 GB)

0.8 GB headroom

⚠️ Tight

RTX 4070 (12 GB)

0.8 GB headroom

⚠️ Tight

RTX 4070 Ti (12 GB)

0.8 GB headroom

⚠️ Tight

RTX 4080 (16 GB)

4.8 GB headroom

✅ Fits

RTX 4090 (24 GB)

12.8 GB headroom

✅ Fits

Mac mini M5 (16 GB) (16 GB)

4.8 GB headroom

✅ Fits

Mac mini M4 (16 GB) (16 GB)

4.8 GB headroom

✅ Fits

MacBook Pro (24 GB) (24 GB)

12.8 GB headroom

✅ Fits

M3 Max (36 GB) (36 GB)

24.8 GB headroom

✅ Fits

💡 Pro Tips:

Always use the "with safety margin" figure when buying a GPU
Q4 gives 90-95% quality with 25% size reduction. Q5 is better if you have room
Context overhead grows with conversation length. Budget 1-3 GB for typical usage
Batch size matters for multi-user APIs. Single-user chat can ignore batch overhead

📋 Share this configuration:

What Do Quantization Levels Mean?

🔍 Key Insight: Quantization trades file size for quality. Q5 is the sweet spot (95% quality, 68% smaller). Q4 is acceptable for most users. Q3 and below are only for edge devices or when VRAM is critically constrained.

Quantization	Size Reduction	Quality	Speed	Use Case
FP16 (16-bit)	None (baseline)	100% (perfect)	Baseline	Research, fine-tuning
Q8 (8-bit)	50%	99% (imperceptible)	Baseline	Production, local servers
Q6 (6-bit)	62.5%	98% (negligible)	Baseline	Balanced use
Q5 (5-bit)	68.75%	95% (minor loss)	Baseline	Good compression, consumer
Q4 (4-bit)	75%	90-95% (acceptable)	Baseline	Maximum compression
Q3 (3-bit)	81%	80-85% (noticeable loss)	Faster	Extreme compression, CPU
Q2 (2-bit)	87.5%	70% (visible loss)	Fastest	Tiny models, edge devices

Quantization levels comparison: FP16 (100% quality), Q8 (99%), Q5 (95%, recommended), Q4 (90–95%), Q3 (80–85%), Q2 (70%). Q5 reduces a 7B model from 14 GB to 4.4 GB with only 5% quality loss.

Quick Reference Table: VRAM by Model and Quantization

Model	FP16	Q8	Q5	Q4
3B	6 GB	3 GB	1.9 GB	1.5 GB
7B	14 GB	7 GB	4.4 GB	3.5 GB
13B	26 GB	13 GB	8.1 GB	6.5 GB
32B	64 GB	32 GB	20 GB	16 GB
70B	140 GB	70 GB	43.75 GB	35 GB

VRAM quick reference matrix: 3B to 70B models at FP16, Q8, Q5, and Q4 quantization. Green = fits in 12 GB GPU. Amber = needs 16–24 GB. Red = requires 40+ GB or multi-GPU.

Real-World Examples

Practical VRAM calculations for common scenarios:

⚠️ Warning: These calculations are for model weights only. Add 25–40% for context, batch processing, and system overhead. Example: 13B Q5 = 8.1 GB model + 2–3 GB overhead = 10–11 GB actual.

RTX 4070 Ti (12 GB): Llama 3.3 7B at Q4 = 3.5 GB ✓ (plenty of room). Llama 3.3 13B at Q5 = 8.1 GB ✓ (tight, but works with no context/batching).
RTX 4090 (24 GB): Llama 3.3 70B at Q5 = 43.75 GB ✗ (too large). Llama 3.3 70B at Q4 = 35 GB ✗ (still too large). Llama 3.3 70B at Q4 with offloading = works (slow, 3–5 tok/sec).
M5 Max Mac (36 GB): Llama 3.3 13B at FP16 = 26 GB ✓ (works). Llama 3.3 70B = impossible (even at Q2, ~70% quality loss).

Real-world GPU scenarios: RTX 4090 (24 GB), RTX 4080 (16 GB), RTX 4070 Ti (12 GB), M5 Max Mac (36 GB), and RTX 3060 (12 GB) — what Llama 3.3 models each can run at various quantization levels.

What Hidden VRAM Overhead Should You Account For?

The formula calculates model weights only. Your actual VRAM usage will be higher due to several factors. Budget an additional 25–40% beyond the calculated amount.

Context window (key-value cache) stores conversation history during inference. A 4k-token context uses approximately 2–3 GB for a 7B model.

📌 Key Point: Batch processing increases VRAM usage linearly. Each additional concurrent prompt (when processing multiple requests simultaneously) uses 500 MB–2 GB of extra memory. If you run batch=4, multiply the single-request VRAM by 4 and add overhead.

System overhead from the operating system and inference engine framework (Ollama, vLLM, llama.cpp) reserves 500 MB–1 GB. Always maintain a safety margin when choosing a GPU.

Hidden VRAM overhead breakdown: context window (2–3 GB for 4k tokens), batch processing (×4 for batch=4), system overhead (500 MB–1 GB), and 25–40% safety margin total.

Which Local LLM Fits Your GPU? 2026 Guide

Use the interactive calculator above to find your exact fit. Below are common GPU scenarios and recommended models.

RTX 3060 (12 GB): Best model: Qwen3 7B Q5 (4.4 GB) ✓. Alternative: Llama 3.2 8B Q4 (4 GB) ✓. Not possible: 32B+ models.
RTX 4070 (12 GB): Best model: Qwen3 13B Q4 (6.5 GB) ✓. With headroom: Llama 3.2 8B Q5 (5 GB) ✓. Not possible: 32B models.
RTX 4070 Ti (12 GB): Best model: Qwen3 13B Q5 (8.1 GB) ✓. Tight fit: Llama 3.3 13B Q4 (6.5 GB) ✓. Not ideal: Batch processing.
RTX 4080 (16 GB): Best model: Qwen3 32B Q4 (16 GB) ✓ tight. Comfortable: Mistral 3.1 24B Q5 (15 GB) ✓. Recommended: Llama 3.3 13B Q8 (13 GB) ✓.
RTX 4090 (24 GB): Best model: Qwen3 32B Q5 (20 GB) ✓. With offload: Llama 3.3 70B Q4 (35 GB – needs offloading). Comfortable: Any 32B at Q5/Q8.
RTX 5090 (32 GB, if released): Best model: Llama 3.3 70B Q4 (35 GB – tight fit). Better: Qwen3 72B Q3 (27 GB) ✓. Comfortable: 70B at Q5+ with batching.

How Accurate Is the Formula?

The formula is accurate within ±10% for most cases. Real-world VRAM usage varies based on implementation, model architecture, and inference engine optimizations.

Sources of variation include: different quantization formats (GGUF vs. safetensors vs. AWQ), model architecture (Transformer vs. non-Transformer), and inference engine-specific optimizations (vLLM, llama.cpp, Ollama).

As of April 2026, treat the formula as a conservative estimate. Always add a 25% safety margin when purchasing GPUs to account for context overhead, batching, and system processes.

VRAM formula accuracy ±10%: variation caused by quantization format (GGUF vs GPTQ vs AWQ), model architecture (Transformer vs MoE), and inference engine (vLLM vs llama.cpp vs Ollama).

Common Mistakes in VRAM Calculation

Forgetting the context overhead. A 7B model at Q4 is 3.5 GB, but with 4k context, it needs 5-6 GB total.
Using model size from HuggingFace without considering quantization. 70B means 70 billion parameters, not 70 GB VRAM.
Not accounting for system overhead. Models never get the full GPU VRAM. Reserve 1-2 GB for the OS and inference engine.
Buying GPU exactly at calculated size. Always buy 25% more. A calculated 18 GB need means get a 24 GB GPU.

4 common VRAM mistakes: forgetting context overhead (adds 1.5–3 GB), confusing 70B parameters with 70 GB VRAM, ignoring 1–2 GB system overhead, and buying a GPU exactly at the calculated size without 25% margin.

Regional Deployment Considerations

European Union (GDPR): Local inference (on-premises) ensures data residency compliance under GDPR. Running models on your own GPU keeps user data in-country. This VRAM calculator helps you size hardware for privacy-first deployments.

Japan (APPI): The Act on Protection of Personal Information (APPI) requires careful data handling. On-device LLM inference reduces data transfer and processing outside Japan. Use this calculator to size systems for Japanese enterprise deployments.

China (Data Security Law): China's 2021 Data Security Law mandates data residency within Chinese borders. Local LLM inference on domestic servers (Alibaba Cloud, Tencent Cloud) is compliant. This formula applies to sizing those deployments with Chinese-optimized models like Qwen3.

In all regions, local inference provides stronger data privacy guarantees than cloud APIs. This VRAM calculator is essential for designing compliant, privacy-preserving AI systems.

FAQ: VRAM and GPU Requirements

Does the formula work for all model types?

Yes. The formula (Model Billions × Quantization Bits) ÷ 8 applies to all Transformer-based models (Llama, Qwen, Mistral, Claude, etc.). Non-Transformer architectures (RNNs, etc.) are rare and may require adjustment.

What quantization should I use?

For most use cases: Q5 offers the best balance (95% quality, 68% size reduction). For consumer GPUs: Q4 is standard (90-95% quality, 75% reduction). For production: Q8 if VRAM allows (99% quality). Avoid Q3 and below unless you have no choice.

How much system RAM do I need?

Minimum 16 GB for offloading. If using VRAM offloading (CPU spillover), system RAM becomes the fallback. For batch processing, add 8–16 GB system RAM beyond model offload requirements. For single-user chat, 16 GB is sufficient.

Does batch size affect VRAM calculation?

Yes. The formula calculates single-request VRAM. Batch size adds additional VRAM linearly: each concurrent request adds ~500 MB–2 GB depending on context length. If running batch=4, add 2–8 GB to the calculated amount.

Can I run a 70B model on a 12 GB GPU?

Only with extreme quantization (Q2, ~70% quality loss) and CPU offloading (very slow, 1–3 tokens/sec). Not practical. Better option: use a 13B model at Q4 (same VRAM, much faster and better quality).

What if my actual VRAM usage is lower than calculated?

The formula is conservative and includes overhead. Lower actual usage means more headroom for batch processing, longer contexts, or safety margin. Use nvidia-smi to measure real usage, then benchmark your model to confirm performance.

Sources

GGUF Specification -- ggerganov/ggml documentation on quantized file format.
Transformers Quantization Docs -- Hugging Face official guide to quantization methods.
Ollama Documentation -- Official Ollama guides for model management.
vLLM Performance Guide -- vLLM framework optimization documentation.
Your VRAM limits model size, but model size isn't the only limit on output quality. Larger context windows enable better responses: context windows explained covers how to work within constraints.

A Note on Third-Party Facts

This article references third-party AI models, benchmarks, prices, and licenses. The AI landscape changes rapidly. Benchmark scores, license terms, model names, and API prices can shift between the time of writing and the time you read this. Before making deployment or compliance decisions based on this article, verify current figures on each provider’s official source: Hugging Face model cards for licenses and benchmarks, provider websites for API pricing, and EUR-Lex for current GDPR and EU AI Act text. This article reflects publicly available information as of May 2026.

Run PromptQuorum with a local LLM, your own API keys, or both — you pick the backend.

Join the PromptQuorum Waitlist →

← Back to Local LLMs

VRAM Calculator 2026: 7B/13B/70B LLM GPU Requirements (Q4, Q5, Q8)

Slide Deck: VRAM Calculator 2026: 7B/13B/70B LLM GPU Requirements (Q4, Q5, Q8)

Quick Facts: VRAM Requirements by GPU

In One Sentence

In Plain Terms

What Is the VRAM Formula?

Interactive VRAM Calculator

Compatible GPUs

What Do Quantization Levels Mean?

Quick Reference Table: VRAM by Model and Quantization

Real-World Examples

What Hidden VRAM Overhead Should You Account For?

Which Local LLM Fits Your GPU? 2026 Guide

How Accurate Is the Formula?

Common Mistakes in VRAM Calculation

Regional Deployment Considerations

FAQ: VRAM and GPU Requirements

Does the formula work for all model types?

What quantization should I use?

How much system RAM do I need?

Does batch size affect VRAM calculation?

Can I run a 70B model on a 12 GB GPU?

What if my actual VRAM usage is lower than calculated?

Related Reading

Sources

A Note on Third-Party Facts