PromptQuorumPromptQuorum
Home/Local LLMs/Q4 vs Q5 vs Q8: Which Quantization Level Should You Use?
Models by Use Case

Q4 vs Q5 vs Q8: Which Quantization Level Should You Use?

Β·8 minΒ·By Hans Kuepper Β· Founder of PromptQuorum, multi-model AI dispatch tool Β· PromptQuorum

Q4 (4-bit) is the sweet spot: 87% VRAM savings with imperceptible quality loss. As of April 2026, Q5 is pointless (only 5% better quality, same VRAM cost as Q4), and Q8 is for perfectionists with excess VRAM.

Q4 (4-bit) is the sweet spot: 87% VRAM savings with imperceptible quality loss. As of April 2026, Q5 is pointless (only 5% better quality, same VRAM cost as Q4), and Q8 is for perfectionists with excess VRAM. FP32 (full precision) is never necessary for inference on consumer hardware.

Slide Deck: Q4 vs Q5 vs Q8: Which Quantization Level Should You Use?

The slide deck below covers: why LLM quantization compresses models (reducing precision from 16-bit to Q4/Q8), VRAM savings across Q2–Q8 levels (70GB for Q4 vs 280GB for FP32), quality loss benchmarks (Q4 retains 99% accuracy, 1.2% loss), and when to use each level by hardware (8GB β†’ Q3/Q4, 16GB β†’ Q4_K_M, 32GB+ β†’ Q5/Q8). Download the PDF as a quantization levels reference card.

Browse the slides below or download as PDF for offline reference. Download Reference Card (PDF)

Key Takeaways

  • Q4 (4-bit): 87.5% VRAM savings, ~1% quality loss. Use this for everything.
  • Q5 (5-bit): 84% VRAM savings, ~0.5% quality loss. Never necessary; Q4 + Q8 bracket Q5.
  • Q8 (8-bit): 50% VRAM savings, <0.1% quality loss. For perfectionists with excess VRAM.
  • FP32 (32-bit): Full precision, 0% loss, 0% savings. Impractical; skip it.
  • Speed: All quantizations run at identical token/sec (memory-bound, not compute-bound).
  • VRAM usage (70B Llama model): FP32=280GB, Q8=140GB, Q5=88GB, Q4=70GB.
  • Recommendation: Use Q4 for 7B-70B. Use Q8 only if you have 32GB+ VRAM and need pristine quality.
  • No one uses Q5 because Q4 + small upgrade = better than Q5 + same hardware.

Quick Facts

  • Q4 VRAM savings: 87.5% vs FP32 (70GB for Llama 3 70B)
  • Q4 quality loss: <1.2% on MMLU benchmark
  • Q8 VRAM savings: 50% vs FP32 (140GB for Llama 3 70B)
  • Speed difference: 0% β€” all quantizations run at identical tokens/sec
  • Q5 verdict: Dead zone β€” Q4 + larger model = better result at same VRAM

Quantization Levels Compared: Q2 Through Q8

QuantizationRAM UsageSpeedQualityBest For
Q2Very LowVery FastPoorExperiments
Q3LowFastLowSmall devices
Q4MediumFastGoodMost users
Q5Medium+MediumVery GoodCoding
Q6HighSlowerExcellentAccuracy focus
Q8Very HighSlowNear FP16Benchmarking
VRAM savings by quantization level: FP32 = 280GB, Q8 = 140GB (50% savings), Q4 = 70GB (75% savings), Q3 = 53GB (81% savings). Q4 is the sweet spot for most users.
VRAM savings by quantization level: FP32 = 280GB, Q8 = 140GB (50% savings), Q4 = 70GB (75% savings), Q3 = 53GB (81% savings). Q4 is the sweet spot for most users.

Best Quantization Level by Use Case

  • 8GB RAM: Q3 or Q4 (small 7B models only)
  • 16GB RAM: Q4_K_M (recommended for most laptops)
  • 32GB RAM: Q5, Q6, or Q8 (larger models, higher quality)
  • Maximum accuracy: Q8 (when VRAM is not a constraint)
Hardware selection guide: 8GB RAM β†’ Q3/Q4 (7B models), 16GB β†’ Q4_K_M (recommended), 32GB+ β†’ Q5/Q6/Q8 (larger models, higher quality), 64GB+ β†’ Q8 or FP32 (research/medical).
Hardware selection guide: 8GB RAM β†’ Q3/Q4 (7B models), 16GB β†’ Q4_K_M (recommended), 32GB+ β†’ Q5/Q6/Q8 (larger models, higher quality), 64GB+ β†’ Q8 or FP32 (research/medical).

How Does Quantization Affect VRAM and Speed?

VRAM calculation: Model size (GB) Γ— quantization factor.

Llama 3 70B:

- FP32: 70B Γ— 4 bytes = 280GB (impractical)

- Q8: 70B Γ— 1 byte = 140GB (needs 140GB VRAM)

- Q4: 70B Γ— 0.5 bytes = 70GB (fits RTX 4090 + some overhead)

Speed: All quantizations are memory-bound (waiting for DRAM), not compute-bound.

Tokens/sec is identical across Q2-FP32 on same hardware.

VRAM bandwidth, not computation, is the bottleneck. Quantization saves VRAM, not time.

Quality Loss by Level: MMLU Benchmark Results

Measured on MMLU benchmark (general knowledge, 57 tasks):

  • Llama 3 70B FP32 baseline: 85.2% accuracy.
  • Llama 3 70B Q8: 85.1% accuracy (-0.1% loss).
  • Llama 3 70B Q5: 84.7% accuracy (-0.5% loss).
  • Llama 3 70B Q4: 84.0% accuracy (-1.2% loss).
  • Llama 3 70B Q3: 81.5% accuracy (-3.7% loss).
  • Real-world impact: Q4 vs Q8 = 1-2% fewer correct answers out of 100 questions.
  • For chat/writing: imperceptible difference. For STEM problems: Q8 safer.
Quality loss benchmarks: Q8 = -0.1% loss, Q5 = -0.5% loss, Q4 = -1.2% loss, Q3 = -3.7% loss on MMLU. Q4 quality loss is imperceptible for most tasks.
Quality loss benchmarks: Q8 = -0.1% loss, Q5 = -0.5% loss, Q4 = -1.2% loss, Q3 = -3.7% loss on MMLU. Q4 quality loss is imperceptible for most tasks.

When to Use Each Level?

Q4: Default. Use for all models. Sweet spot of compression + quality.

Q5: Never. Wasteful. If you need Q5 quality, use Q4 with slightly larger model. If you have Q5's VRAM (88GB), use Q4 on 70B instead.

Q8: Only if you have 32GB+ VRAM AND model is <70B AND you need perfect accuracy (research, medical use).

Q3: Budget squeeze. 3% quality loss acceptable? Use Q3. Otherwise, upgrade GPU or use smaller model.

Q2: Desperation. Quality loss too high for most. Use only if OOM on Q3.

Why Is Q4 the Industry Standard?

Q4 is optimal because:

1. 87.5% VRAM savings (best ratio).

2. <1.2% quality loss (imperceptible to users).

3. No speed penalty (memory-bound, not compute-bound).

4. Fits consumer hardware (70B on RTX 4090 24GB).

5. Industry standard (HuggingFace, Ollama default to Q4).

Every model released post-2024 includes a Q4 variant for production use.

If a model only has FP32/Q8/Q5, the project is not production-ready.

Common Misconceptions

  • Q4 sounds "low quality" because 4-bit seems small. False. 1% quality loss is imperceptible.
  • Quantization makes inference slower. False. Speed is identical (memory-bound, not compute-bound).
  • I should use Q8 to be safe. False. Q4 is proven, safe, and standard. Q8 is wasteful.
  • I need FP32 for accuracy. False. Never true. Q8 is sufficient even for research.

FAQ

What is LLM quantization?

Quantization compresses a model by reducing numerical precision, lowering memory usage and increasing speed.

What is the best quantization level?

Q4_K_M is the best default for most users, balancing performance and quality.

Does quantization reduce accuracy?

Yes, but Q4–Q5 retain most model quality while significantly reducing memory requirements.

Is Q8 worth it?

Only if you need maximum accuracy and have enough RAM. Most users will not benefit from Q8.

Should I use Q4 or Q8 for coding?

Q4. Speed is identical, quality difference is 1%, which is imperceptible for code generation.

Can I use Q3 if I'm tight on VRAM?

Yes. 3% quality loss is acceptable for chat/creative writing. Unacceptable for reasoning/math.

Is there a Q6 or Q7?

No standard. Some projects implement custom levels, but Q4/Q5/Q8 are the industry standard.

Which quantization is fastest?

All identical speed (memory-bound). Q2 is slightly faster due to less memory transfer, but difference is <5%.

Can I dequantize Q4 back to FP32?

No, data is lost. Q4 β†’ FP32 interpolation doesn't restore original. Quantization is one-way.

Should I quantize my fine-tuned model?

Yes, after training. Quantize the trained weights to Q4 for deployment.

What does GGUF Q4_K_M mean?

Q4_K_M is a refined Q4 variant using K-quants (mixed precision). The K algorithm preserves more accuracy on attention layers. Q4_K_M is the recommended download on HuggingFace for most models β€” effectively Q4 with ~0.3% better accuracy at the same VRAM cost.

Does quantization affect context length?

No. Quantization compresses model weights, not the context window. A Q4 model has the same maximum context length (e.g., 128k tokens) as its FP32 counterpart. Context memory (KV cache) is a separate concern from quantization.

Sources

A Note on Third-Party Facts

This article references third-party AI models, benchmarks, prices, and licenses. The AI landscape changes rapidly. Benchmark scores, license terms, model names, and API prices can shift between the time of writing and the time you read this. Before making deployment or compliance decisions based on this article, verify current figures on each provider's official source: Hugging Face model cards for licenses and benchmarks, provider websites for API pricing, and EUR-Lex for current GDPR and EU AI Act text. This article reflects publicly available information as of May 2026.

Compare your local LLM against 25+ cloud models simultaneously with PromptQuorum.

Join the PromptQuorum Waitlist β†’

← Back to Local LLMs

Q4 vs Q5 vs Q8: Best LLM Quantization for Speed, RAM & Quality (2026)