PromptQuorumPromptQuorum
Home/Local LLMs/VRAM Calculator 2026: 7B/13B/70B LLM GPU Requirements (Q4, Q5, Q8)
Hardware & Performance

VRAM Calculator 2026: 7B/13B/70B LLM GPU Requirements (Q4, Q5, Q8)

Β·10 min readΒ·By Hans Kuepper Β· Founder of PromptQuorum, multi-model AI dispatch tool Β· PromptQuorum

This guide explains how to calculate exact VRAM requirements for any model and hardware combination. The formula is simple: (Model Size GB Γ— Quantization Bits) Γ· 8 = VRAM Needed.

Interactive VRAM calculator for local LLMs. Enter model size, quantization, context length, and batch size to calculate exact GPU VRAM needs. Works for 1B–405B models at FP16, Q8, Q5, Q4 quantization. Updated April 2026 with RTX 4090, 4080, 3060 fit analysis and overhead calculations.

Slide Deck: VRAM Calculator 2026: 7B/13B/70B LLM GPU Requirements (Q4, Q5, Q8)

The slide deck below covers: VRAM formula (Model Billions Γ— Quantization Bits) Γ· 8, quantization levels Q2–FP16 with quality trade-offs, quick reference table (3B–70B models), real-world GPU scenarios (RTX 4090, 4080, M5 Max), and regional compliance (EU GDPR, Japan APPI, China Data Security Law). Download the PDF as a VRAM calculator reference card.

Browse the slides below or download as PDF for offline reference. Download Reference Card (PDF)

Key Takeaways

  • VRAM = (Model Size Γ— Quantization Bits) Γ· 8
  • FP16 = 16 bits, Q8 = 8, Q5 = 5, Q4 = 4 bits
  • Example: 13B model at Q4 = (13 Γ— 4) Γ· 8 = 6.5 GB
  • Always add 25% buffer for context, system overhead, and safe margin
  • As of April 2026, this formula is accurate within Β±10%

Quick Facts: VRAM Requirements by GPU

  • RTX 4090 (24 GB): Llama 3.1 7B at Q4 (3.5 GB), 13B at Q5 (8.1 GB), 70B at Q4 with offloading
  • RTX 4080 (16 GB): Llama 3.1 7B at Q4 (3.5 GB), 13B at Q5 (8.1 GB), 32B at Q4 (16 GB)
  • RTX 4070 Ti (12 GB): Llama 3.1 7B at Q4 (3.5 GB), 13B at Q5 (8.1 GB with tight headroom)
  • M5 Max Mac (36 GB unified): Llama 3.1 13B at FP16 (26 GB), 70B not possible without extreme quantization
  • Rule of thumb: Always budget 25–40% extra VRAM for context, batching, and system overhead beyond the formula result

In One Sentence

VRAM required (GB) equals model parameters in billions multiplied by quantization bits (16 for FP16, 8 for Q8, 4 for Q4, etc.), divided by 8.

In Plain Terms

Think of VRAM like bookshelf space. Bigger books (models with more parameters like 70B) take more shelf space. Smaller books (Q4 quantization) take less space than larger ones (FP16). The formula tells you exactly how many "shelves" (GB) you need. Always leave extra empty shelves for conversations, multiple requests at once, and system software.

What Is the VRAM Formula?

The formula for VRAM requirement is deceptively simple:

πŸ’‘ Pro Tip: This formula calculates model weights only. Real VRAM usage is 25–40% higher due to context, batching, and system overhead. Always add a safety margin.

bash
VRAM (GB) = (Model Size in Billions Γ— Quantization Bits) Γ· 8

Example:
- 7B model at 4-bit quantization
- (7 Γ— 4) Γ· 8 = 3.5 GB

- 13B model at 5-bit quantization
- (13 Γ— 5) Γ· 8 = 8.125 GB

- 70B model at 8-bit quantization
- (70 Γ— 8) Γ· 8 = 70 GB
VRAM formula with 3 calculation examples: 7B model at Q4 = 3.5 GB, 13B at Q5 = 8.1 GB, 70B at Q8 = 70 GB. Always add 25–40% buffer for context, batching, and system overhead.
VRAM formula with 3 calculation examples: 7B model at Q4 = 3.5 GB, 13B at Q5 = 8.1 GB, 70B at Q8 = 70 GB. Always add 25–40% buffer for context, batching, and system overhead.

Interactive VRAM Calculator

Use this calculator to compute exact VRAM requirements for any combination of model, quantization, context, and batch size. Select your configuration and see which GPUs fit.

Popular Models

Base Model

6.50 GB

Context OH

1.50 GB

Batch OH

0.00 GB

System OH

1.00 GB

Total Minimum

9.00 GB

Recommended (with 25% safety margin)

11.25 GB

πŸ‘‰ Look for a GPU with at least 11.25 GB VRAM

Compatible GPUs

RTX 3060 (12 GB)

0.8 GB headroom

⚠️ Tight

RTX 4070 (12 GB)

0.8 GB headroom

⚠️ Tight

RTX 4070 Ti (12 GB)

0.8 GB headroom

⚠️ Tight

RTX 4080 (16 GB)

4.8 GB headroom

βœ… Fits

RTX 4090 (24 GB)

12.8 GB headroom

βœ… Fits

Mac mini M5 (16 GB) (16 GB)

4.8 GB headroom

βœ… Fits

Mac mini M4 (16 GB) (16 GB)

4.8 GB headroom

βœ… Fits

MacBook Pro (24 GB) (24 GB)

12.8 GB headroom

βœ… Fits

M3 Max (36 GB) (36 GB)

24.8 GB headroom

βœ… Fits

πŸ’‘ Pro Tips:

  • Always use the "with safety margin" figure when buying a GPU
  • Q4 gives 90-95% quality with 25% size reduction. Q5 is better if you have room
  • Context overhead grows with conversation length. Budget 1-3 GB for typical usage
  • Batch size matters for multi-user APIs. Single-user chat can ignore batch overhead

πŸ“‹ Share this configuration:

Loading...

What Do Quantization Levels Mean?

πŸ” Key Insight: Quantization trades file size for quality. Q5 is the sweet spot (95% quality, 68% smaller). Q4 is acceptable for most users. Q3 and below are only for edge devices or when VRAM is critically constrained.

QuantizationSize ReductionQualitySpeedUse Case
FP16 (16-bit)None (baseline)100% (perfect)BaselineResearch, fine-tuning
Q8 (8-bit)50%99% (imperceptible)BaselineProduction, local servers
Q6 (6-bit)62.5%98% (negligible)BaselineBalanced use
Q5 (5-bit)68.75%95% (minor loss)BaselineGood compression, consumer
Q4 (4-bit)75%90-95% (acceptable)BaselineMaximum compression
Q3 (3-bit)81%80-85% (noticeable loss)FasterExtreme compression, CPU
Q2 (2-bit)87.5%70% (visible loss)FastestTiny models, edge devices
Quantization levels comparison: FP16 (100% quality), Q8 (99%), Q5 (95%, recommended), Q4 (90–95%), Q3 (80–85%), Q2 (70%). Q5 reduces a 7B model from 14 GB to 4.4 GB with only 5% quality loss.
Quantization levels comparison: FP16 (100% quality), Q8 (99%), Q5 (95%, recommended), Q4 (90–95%), Q3 (80–85%), Q2 (70%). Q5 reduces a 7B model from 14 GB to 4.4 GB with only 5% quality loss.

Quick Reference Table: VRAM by Model and Quantization

ModelFP16Q8Q5Q4
3B6 GB3 GB1.9 GB1.5 GB
7B14 GB7 GB4.4 GB3.5 GB
13B26 GB13 GB8.1 GB6.5 GB
32B64 GB32 GB20 GB16 GB
70B140 GB70 GB43.75 GB35 GB
VRAM quick reference matrix: 3B to 70B models at FP16, Q8, Q5, and Q4 quantization. Green = fits in 12 GB GPU. Amber = needs 16–24 GB. Red = requires 40+ GB or multi-GPU.
VRAM quick reference matrix: 3B to 70B models at FP16, Q8, Q5, and Q4 quantization. Green = fits in 12 GB GPU. Amber = needs 16–24 GB. Red = requires 40+ GB or multi-GPU.

Real-World Examples

Practical VRAM calculations for common scenarios:

⚠️ Warning: These calculations are for model weights only. Add 25–40% for context, batch processing, and system overhead. Example: 13B Q5 = 8.1 GB model + 2–3 GB overhead = 10–11 GB actual.

  • RTX 4070 Ti (12 GB): Llama 3.1 7B at Q4 = 3.5 GB βœ“ (plenty of room). Llama 3.1 13B at Q5 = 8.1 GB βœ“ (tight, but works with no context/batching).
  • RTX 4090 (24 GB): Llama 3.1 70B at Q5 = 43.75 GB βœ— (too large). Llama 3.1 70B at Q4 = 35 GB βœ— (still too large). Llama 3.1 70B at Q4 with offloading = works (slow, 3–5 tok/sec).
  • M5 Max Mac (36 GB): Llama 3.1 13B at FP16 = 26 GB βœ“ (works). Llama 3.1 70B = impossible (even at Q2, ~70% quality loss).
Real-world GPU scenarios: RTX 4090 (24 GB), RTX 4080 (16 GB), RTX 4070 Ti (12 GB), M5 Max Mac (36 GB), and RTX 3060 (12 GB) β€” what Llama 3.1 models each can run at various quantization levels.
Real-world GPU scenarios: RTX 4090 (24 GB), RTX 4080 (16 GB), RTX 4070 Ti (12 GB), M5 Max Mac (36 GB), and RTX 3060 (12 GB) β€” what Llama 3.1 models each can run at various quantization levels.

What Hidden VRAM Overhead Should You Account For?

The formula calculates model weights only. Your actual VRAM usage will be higher due to several factors. Budget an additional 25–40% beyond the calculated amount.

Context window (key-value cache) stores conversation history during inference. A 4k-token context uses approximately 2–3 GB for a 7B model.

πŸ“Œ Key Point: Batch processing increases VRAM usage linearly. Each additional concurrent prompt (when processing multiple requests simultaneously) uses 500 MB–2 GB of extra memory. If you run batch=4, multiply the single-request VRAM by 4 and add overhead.

System overhead from the operating system and inference engine framework (Ollama, vLLM, llama.cpp) reserves 500 MB–1 GB. Always maintain a safety margin when choosing a GPU.

Hidden VRAM overhead breakdown: context window (2–3 GB for 4k tokens), batch processing (Γ—4 for batch=4), system overhead (500 MB–1 GB), and 25–40% safety margin total.
Hidden VRAM overhead breakdown: context window (2–3 GB for 4k tokens), batch processing (Γ—4 for batch=4), system overhead (500 MB–1 GB), and 25–40% safety margin total.

Which Local LLM Fits Your GPU? 2026 Guide

Use the interactive calculator above to find your exact fit. Below are common GPU scenarios and recommended models.

  • RTX 3060 (12 GB): Best model: Qwen2.5 7B Q5 (4.4 GB) βœ“. Alternative: Llama 3.2 8B Q4 (4 GB) βœ“. Not possible: 32B+ models.
  • RTX 4070 (12 GB): Best model: Qwen2.5 13B Q4 (6.5 GB) βœ“. With headroom: Llama 3.2 8B Q5 (5 GB) βœ“. Not possible: 32B models.
  • RTX 4070 Ti (12 GB): Best model: Qwen2.5 13B Q5 (8.1 GB) βœ“. Tight fit: Llama 3.3 13B Q4 (6.5 GB) βœ“. Not ideal: Batch processing.
  • RTX 4080 (16 GB): Best model: Qwen2.5 32B Q4 (16 GB) βœ“ tight. Comfortable: Mistral 3.1 24B Q5 (15 GB) βœ“. Recommended: Llama 3.3 13B Q8 (13 GB) βœ“.
  • RTX 4090 (24 GB): Best model: Qwen2.5 32B Q5 (20 GB) βœ“. With offload: Llama 3.3 70B Q4 (35 GB – needs offloading). Comfortable: Any 32B at Q5/Q8.
  • RTX 5090 (32 GB, if released): Best model: Llama 3.3 70B Q4 (35 GB – tight fit). Better: Qwen2.5 72B Q3 (27 GB) βœ“. Comfortable: 70B at Q5+ with batching.

How Accurate Is the Formula?

The formula is accurate within Β±10% for most cases. Real-world VRAM usage varies based on implementation, model architecture, and inference engine optimizations.

Sources of variation include: different quantization formats (GGUF vs. safetensors vs. AWQ), model architecture (Transformer vs. non-Transformer), and inference engine-specific optimizations (vLLM, llama.cpp, Ollama).

As of April 2026, treat the formula as a conservative estimate. Always add a 25% safety margin when purchasing GPUs to account for context overhead, batching, and system processes.

VRAM formula accuracy Β±10%: variation caused by quantization format (GGUF vs GPTQ vs AWQ), model architecture (Transformer vs MoE), and inference engine (vLLM vs llama.cpp vs Ollama).
VRAM formula accuracy Β±10%: variation caused by quantization format (GGUF vs GPTQ vs AWQ), model architecture (Transformer vs MoE), and inference engine (vLLM vs llama.cpp vs Ollama).

Common Mistakes in VRAM Calculation

  • Forgetting the context overhead. A 7B model at Q4 is 3.5 GB, but with 4k context, it needs 5-6 GB total.
  • Using model size from HuggingFace without considering quantization. 70B means 70 billion parameters, not 70 GB VRAM.
  • Not accounting for system overhead. Models never get the full GPU VRAM. Reserve 1-2 GB for the OS and inference engine.
  • Buying GPU exactly at calculated size. Always buy 25% more. A calculated 18 GB need means get a 24 GB GPU.
4 common VRAM mistakes: forgetting context overhead (adds 1.5–3 GB), confusing 70B parameters with 70 GB VRAM, ignoring 1–2 GB system overhead, and buying a GPU exactly at the calculated size without 25% margin.
4 common VRAM mistakes: forgetting context overhead (adds 1.5–3 GB), confusing 70B parameters with 70 GB VRAM, ignoring 1–2 GB system overhead, and buying a GPU exactly at the calculated size without 25% margin.

Regional Deployment Considerations

European Union (GDPR): Local inference (on-premises) ensures data residency compliance under GDPR. Running models on your own GPU keeps user data in-country. This VRAM calculator helps you size hardware for privacy-first deployments.

Japan (APPI): The Act on Protection of Personal Information (APPI) requires careful data handling. On-device LLM inference reduces data transfer and processing outside Japan. Use this calculator to size systems for Japanese enterprise deployments.

China (Data Security Law): China's 2021 Data Security Law mandates data residency within Chinese borders. Local LLM inference on domestic servers (Alibaba Cloud, Tencent Cloud) is compliant. This formula applies to sizing those deployments with Chinese-optimized models like Qwen2.5.

In all regions, local inference provides stronger data privacy guarantees than cloud APIs. This VRAM calculator is essential for designing compliant, privacy-preserving AI systems.

FAQ: VRAM and GPU Requirements

Does the formula work for all model types?

Yes. The formula (Model Billions Γ— Quantization Bits) Γ· 8 applies to all Transformer-based models (Llama, Qwen, Mistral, Claude, etc.). Non-Transformer architectures (RNNs, etc.) are rare and may require adjustment.

What quantization should I use?

For most use cases: Q5 offers the best balance (95% quality, 68% size reduction). For consumer GPUs: Q4 is standard (90-95% quality, 75% reduction). For production: Q8 if VRAM allows (99% quality). Avoid Q3 and below unless you have no choice.

How much system RAM do I need?

Minimum 16 GB for offloading. If using VRAM offloading (CPU spillover), system RAM becomes the fallback. For batch processing, add 8–16 GB system RAM beyond model offload requirements. For single-user chat, 16 GB is sufficient.

Does batch size affect VRAM calculation?

Yes. The formula calculates single-request VRAM. Batch size adds additional VRAM linearly: each concurrent request adds ~500 MB–2 GB depending on context length. If running batch=4, add 2–8 GB to the calculated amount.

Can I run a 70B model on a 12 GB GPU?

Only with extreme quantization (Q2, ~70% quality loss) and CPU offloading (very slow, 1–3 tokens/sec). Not practical. Better option: use a 13B model at Q4 (same VRAM, much faster and better quality).

What if my actual VRAM usage is lower than calculated?

The formula is conservative and includes overhead. Lower actual usage means more headroom for batch processing, longer contexts, or safety margin. Use nvidia-smi to measure real usage, then benchmark your model to confirm performance.

Sources

A Note on Third-Party Facts

This article references third-party AI models, benchmarks, prices, and licenses. The AI landscape changes rapidly. Benchmark scores, license terms, model names, and API prices can shift between the time of writing and the time you read this. Before making deployment or compliance decisions based on this article, verify current figures on each provider's official source: Hugging Face model cards for licenses and benchmarks, provider websites for API pricing, and EUR-Lex for current GDPR and EU AI Act text. This article reflects publicly available information as of May 2026.

Compare your local LLM against 25+ cloud models simultaneously with PromptQuorum.

Join the PromptQuorum Waitlist β†’

← Back to Local LLMs

VRAM Calculator 2026: 7B/13B/70B LLM GPU Requirements (Q4, Q5, Q8)