Wichtigste Erkenntnisse
- VRAM = (Model Size Γ Quantization Bits) Γ· 8
- FP16 = 16 bits, Q8 = 8, Q5 = 5, Q4 = 4 bits
- Example: 13B model at Q4 = (13 Γ 4) Γ· 8 = 6.5 GB
- Always add 25% buffer for context, system overhead, and safe margin
- As of April 2026, this formula is accurate within Β±10%
What Is the VRAM Formula?
The formula for VRAM requirement:
VRAM (GB) = (Model Size in Billions Γ Quantization Bits) Γ· 8
Example:
- 7B model at 4-bit quantization
- (7 Γ 4) Γ· 8 = 3.5 GB
- 13B model at 5-bit quantization
- (13 Γ 5) Γ· 8 = 8.125 GB
- 70B model at 8-bit quantization
- (70 Γ 8) Γ· 8 = 70 GBWhat Do Quantization Levels Mean?
| Quantization | Size Reduction | Quality | Speed | Use Case |
|---|---|---|---|---|
| FP16 (16-bit) | None (baseline) | 100% (perfect) | Baseline | Research, fine-tuning |
| Q8 (8-bit) | 50% | 99% (imperceptible) | Baseline | Production, local servers |
| Q6 (6-bit) | 62.5% | 98% (negligible) | Baseline | Balanced use |
| Q5 (5-bit) | 68.75% | 95% (minor loss) | Baseline | Good compression, consumer |
| Q4 (4-bit) | 75% | 90β95% (acceptable) | Baseline | Maximum compression |
| Q3 (3-bit) | 81% | 80β85% (noticeable loss) | Faster | Extreme compression, CPU |
| Q2 (2-bit) | 87.5% | 70% (visible loss) | Fastest | Tiny models, edge devices |
Quick Reference Table: VRAM by Model and Quantization
| Model Size | FP16 (full precision) | Q8 (8-bit) | Q5 (5-bit) | Q4 (4-bit) |
|---|---|---|---|---|
| β | β | β | β | β |
| β | β | β | β | β |
| β | β | β | β | β |
| β | β | β | β | β |
| β | β | β | β | β |
Real-World Examples
Practical VRAM calculations for common scenarios:
- RTX 4070 Ti (12 GB): Llama 3.1 7B at Q4 = 3.5 GB β (plenty of room). Llama 3.1 13B at Q5 = 8.1 GB β (tight, but works).
- RTX 4090 (24 GB): Llama 3.1 70B at Q5 = 43.75 GB β (too large). Llama 3.1 70B at Q4 = 35 GB β (still too large). Llama 3.1 70B at Q4 with offloading = works.
- M3 Max Mac (36 GB): Llama 3.1 13B at FP16 = 26 GB β (works). Llama 3.1 70B = impossible (even at Q4).
How Accurate Is the Formula?
The formula is accurate within Β±10% for most cases. Variations occur from:
- Different quantization implementations (GGUF vs. safetensors vs. AWQ)
- Model architecture (Transformer vs. non-Transformer)
- Inference engine optimizations (vLLM, llama.cpp, etc.)
As of April 2026, use the formula as a conservative estimate and add 25% safety margin.
Common Mistakes in VRAM Calculation
- Forgetting the context overhead. A 7B model at Q4 is 3.5 GB, but with 4k context, it needs 5β6 GB total.
- Using model size from HuggingFace without considering quantization. 70B means 70 billion parameters, not 70 GB VRAM.
- Not accounting for system overhead. Models never get the full GPU VRAM. Reserve 1β2 GB for the OS and inference engine.
- Buying GPU exactly at calculated size. Always buy 25% more. A calculated 18 GB need means get a 24 GB GPU.
Sources
- GGUF Specification β github.com/ggerganov/ggml/blob/master/docs/gguf.md
- Transformers Quantization β huggingface.co/docs/transformers/quantization