PromptQuorumPromptQuorum
Accueil/LLMs locaux/VRAM Calculator for Local LLMs: Calculate Exact GPU Requirements
Hardware & Performance

VRAM Calculator for Local LLMs: Calculate Exact GPU Requirements

·8 min read·Par Hans Kuepper · Fondateur de PromptQuorum, outil de dispatch multi-modèle · PromptQuorum

This guide explains how to calculate exact VRAM requirements for any model and hardware combination. The formula is simple: (Model Size GB × Quantization Bits) ÷ 8 = VRAM Needed. As of April 2026, understanding VRAM math prevents expensive hardware mistakes.

Points clés

  • VRAM = (Model Size × Quantization Bits) ÷ 8
  • FP16 = 16 bits, Q8 = 8, Q5 = 5, Q4 = 4 bits
  • Example: 13B model at Q4 = (13 × 4) ÷ 8 = 6.5 GB
  • Always add 25% buffer for context, system overhead, and safe margin
  • As of April 2026, this formula is accurate within ±10%

What Is the VRAM Formula?

The formula for VRAM requirement:

bash
VRAM (GB) = (Model Size in Billions × Quantization Bits) ÷ 8

Example:
- 7B model at 4-bit quantization
- (7 × 4) ÷ 8 = 3.5 GB

- 13B model at 5-bit quantization
- (13 × 5) ÷ 8 = 8.125 GB

- 70B model at 8-bit quantization
- (70 × 8) ÷ 8 = 70 GB

What Do Quantization Levels Mean?

QuantizationSize ReductionQualitySpeedUse Case
FP16 (16-bit)None (baseline)100% (perfect)BaselineResearch, fine-tuning
Q8 (8-bit)50%99% (imperceptible)BaselineProduction, local servers
Q6 (6-bit)62.5%98% (negligible)BaselineBalanced use
Q5 (5-bit)68.75%95% (minor loss)BaselineGood compression, consumer
Q4 (4-bit)75%90–95% (acceptable)BaselineMaximum compression
Q3 (3-bit)81%80–85% (noticeable loss)FasterExtreme compression, CPU
Q2 (2-bit)87.5%70% (visible loss)FastestTiny models, edge devices

Quick Reference Table: VRAM by Model and Quantization

Model SizeFP16 (full precision)Q8 (8-bit)Q5 (5-bit)Q4 (4-bit)

Real-World Examples

Practical VRAM calculations for common scenarios:

  • RTX 4070 Ti (12 GB): Llama 3.1 7B at Q4 = 3.5 GB ✓ (plenty of room). Llama 3.1 13B at Q5 = 8.1 GB ✓ (tight, but works).
  • RTX 4090 (24 GB): Llama 3.1 70B at Q5 = 43.75 GB ✗ (too large). Llama 3.1 70B at Q4 = 35 GB ✗ (still too large). Llama 3.1 70B at Q4 with offloading = works.
  • M3 Max Mac (36 GB): Llama 3.1 13B at FP16 = 26 GB ✓ (works). Llama 3.1 70B = impossible (even at Q4).

What Hidden VRAM Overhead Should You Account For?

The formula calculates model weights only. Additional VRAM is used by:

  • Context (key-value cache): Stores conversation history. A 4k-token context uses ~2–3 GB for 7B models.
  • Batch processing: Running multiple prompts uses extra VRAM. Each additional concurrent prompt uses ~500MB–2GB.
  • System overhead: Operating system and inference engine overhead: ~500MB–1GB.
  • Safety margin: Always budget 25% extra VRAM.
  • Total overhead: 25–40% of model size.

How Accurate Is the Formula?

The formula is accurate within ±10% for most cases. Variations occur from:

- Different quantization implementations (GGUF vs. safetensors vs. AWQ)

- Model architecture (Transformer vs. non-Transformer)

- Inference engine optimizations (vLLM, llama.cpp, etc.)

As of April 2026, use the formula as a conservative estimate and add 25% safety margin.

Common Mistakes in VRAM Calculation

  • Forgetting the context overhead. A 7B model at Q4 is 3.5 GB, but with 4k context, it needs 5–6 GB total.
  • Using model size from HuggingFace without considering quantization. 70B means 70 billion parameters, not 70 GB VRAM.
  • Not accounting for system overhead. Models never get the full GPU VRAM. Reserve 1–2 GB for the OS and inference engine.
  • Buying GPU exactly at calculated size. Always buy 25% more. A calculated 18 GB need means get a 24 GB GPU.

Sources

  • GGUF Specification — github.com/ggerganov/ggml/blob/master/docs/gguf.md
  • Transformers Quantization — huggingface.co/docs/transformers/quantization

Comparez votre LLM local avec 25+ modèles cloud simultanément avec PromptQuorum.

Essayer PromptQuorum gratuitement →

← Retour aux LLMs locaux

VRAM Calculator Local LLM | PromptQuorum