Key Takeaways
- Q4 (4-bit): 87.5% VRAM savings, ~1% quality loss. Use this for everything.
- Q5 (5-bit): 84% VRAM savings, ~0.5% quality loss. Never necessary; Q4 + Q8 bracket Q5.
- Q8 (8-bit): 50% VRAM savings, <0.1% quality loss. For perfectionists with excess VRAM.
- FP32 (32-bit): Full precision, 0% loss, 0% savings. Impractical; skip it.
- Speed: All quantizations run at identical token/sec (memory-bound, not compute-bound).
- VRAM usage (70B Llama model): FP32=280GB, Q8=140GB, Q5=88GB, Q4=70GB.
- Recommendation: Use Q4 for 7Bβ70B. Use Q8 only if you have 32GB+ VRAM and need pristine quality.
- No one uses Q5 because Q4 + small upgrade = better than Q5 + same hardware.
Quantization Levels at a Glance
| Level | Bits/Parameter | VRAM (70B) | Quality vs FP32 | Speed | Use Case |
|---|---|---|---|---|---|
| β | β | β | β | β | β |
| β | β | β | β | β | β |
| β | β | β | β | β | β |
| β | β | β | β | β | β |
| β | β | β | β | β | β |
| β | β | β | β | β | β |
| β | β | β | β | β | β |
VRAM & Performance Impact
VRAM calculation: Model size (GB) Γ quantization factor.
Llama 3 70B:
- FP32: 70B Γ 4 bytes = 280GB (impractical)
- Q8: 70B Γ 1 byte = 140GB (needs 140GB VRAM)
- Q4: 70B Γ 0.5 bytes = 70GB (fits RTX 4090 + some overhead)
Speed: All quantizations are memory-bound (waiting for DRAM), not compute-bound.
Tokens/sec is identical across Q2βFP32 on same hardware.
VRAM bandwidth, not computation, is the bottleneck. Quantization saves VRAM, not time.
Quality Loss: Objective Benchmarks
Measured on MMLU benchmark (general knowledge, 57 tasks):
- Llama 3 70B FP32 baseline: 85.2% accuracy.
- Llama 3 70B Q8: 85.1% accuracy (-0.1% loss).
- Llama 3 70B Q5: 84.7% accuracy (-0.5% loss).
- Llama 3 70B Q4: 84.0% accuracy (-1.2% loss).
- Llama 3 70B Q3: 81.5% accuracy (-3.7% loss).
- Real-world impact: Q4 vs Q8 = 1β2% fewer correct answers out of 100 questions.
- For chat/writing: imperceptible difference. For STEM problems: Q8 safer.
When to Use Each Level
Q4: Default. Use for all models. Sweet spot of compression + quality.
Q5: Never. Wasteful. If you need Q5 quality, use Q4 with slightly larger model. If you have Q5's VRAM (88GB), use Q4 on 70B instead.
Q8: Only if you have 32GB+ VRAM AND model is <70B AND you need perfect accuracy (research, medical use).
Q3: Budget squeeze. 3% quality loss acceptable? Use Q3. Otherwise, upgrade GPU or use smaller model.
Q2: Desperation. Quality loss too high for most. Use only if OOM on Q3.
Q4 Deep Dive: Why It's the Standard
Q4 is optimal because:
1. 87.5% VRAM savings (best ratio).
2. <1.2% quality loss (imperceptible to users).
3. No speed penalty (memory-bound, not compute-bound).
4. Fits consumer hardware (70B on RTX 4090 24GB).
5. Industry standard (HuggingFace, Ollama default to Q4).
Every model released post-2024 includes a Q4 variant for production use.
If a model only has FP32/Q8/Q5, the project is not production-ready.
Common Misconceptions
- Q4 sounds "low quality" because 4-bit seems small. False. 1% quality loss is imperceptible.
- Quantization makes inference slower. False. Speed is identical (memory-bound, not compute-bound).
- I should use Q8 to be safe. False. Q4 is proven, safe, and standard. Q8 is wasteful.
- I need FP32 for accuracy. False. Never true. Q8 is sufficient even for research.
FAQ
Should I use Q4 or Q8 for coding?
Q4. Speed is identical, quality difference is 1%, which is imperceptible for code generation.
Can I use Q3 if I'm tight on VRAM?
Yes. 3% quality loss is acceptable for chat/creative writing. Unacceptable for reasoning/math.
Is there a Q6 or Q7?
No standard. Some projects implement custom levels, but Q4/Q5/Q8 are the industry standard.
Which quantization is fastest?
All identical speed (memory-bound). Q2 is slightly faster due to less memory transfer, but difference is <5%.
Can I dequantize Q4 back to FP32?
No, data is lost. Q4 β FP32 interpolation doesn't restore original. Quantization is one-way.
Should I quantize my fine-tuned model?
Yes, after training. Quantize the trained weights to Q4 for deployment.
Sources
- MMLU benchmark: quantization impact on reasoning tasks (OpenAI Evals)
- Llama model card: accuracy across quantization levels
- Quantization research: "Towards Quantization-Aware Deep Neural Networks" (arXiv 2024)