PromptQuorumPromptQuorum
主页/本地LLM/Q4 vs Q5 vs Q8: Which Quantization Level Should You Use?
Models by Use Case

Q4 vs Q5 vs Q8: Which Quantization Level Should You Use?

·8 min·Hans Kuepper 作者 · PromptQuorum创始人,多模型AI调度工具 · PromptQuorum

Q4 (4-bit) is the sweet spot: 87% VRAM savings with imperceptible quality loss. As of April 2026, Q5 is pointless (only 5% better quality, same VRAM cost as Q4), and Q8 is for perfectionists with excess VRAM. FP32 (full precision) is never necessary for inference on consumer hardware.

关键要点

  • Q4 (4-bit): 87.5% VRAM savings, ~1% quality loss. Use this for everything.
  • Q5 (5-bit): 84% VRAM savings, ~0.5% quality loss. Never necessary; Q4 + Q8 bracket Q5.
  • Q8 (8-bit): 50% VRAM savings, <0.1% quality loss. For perfectionists with excess VRAM.
  • FP32 (32-bit): Full precision, 0% loss, 0% savings. Impractical; skip it.
  • Speed: All quantizations run at identical token/sec (memory-bound, not compute-bound).
  • VRAM usage (70B Llama model): FP32=280GB, Q8=140GB, Q5=88GB, Q4=70GB.
  • Recommendation: Use Q4 for 7B–70B. Use Q8 only if you have 32GB+ VRAM and need pristine quality.
  • No one uses Q5 because Q4 + small upgrade = better than Q5 + same hardware.

Quantization Levels at a Glance

LevelBits/ParameterVRAM (70B)Quality vs FP32SpeedUse Case

VRAM & Performance Impact

VRAM calculation: Model size (GB) × quantization factor.

Llama 3 70B:

- FP32: 70B × 4 bytes = 280GB (impractical)

- Q8: 70B × 1 byte = 140GB (needs 140GB VRAM)

- Q4: 70B × 0.5 bytes = 70GB (fits RTX 4090 + some overhead)

Speed: All quantizations are memory-bound (waiting for DRAM), not compute-bound.

Tokens/sec is identical across Q2–FP32 on same hardware.

VRAM bandwidth, not computation, is the bottleneck. Quantization saves VRAM, not time.

Quality Loss: Objective Benchmarks

Measured on MMLU benchmark (general knowledge, 57 tasks):

  • Llama 3 70B FP32 baseline: 85.2% accuracy.
  • Llama 3 70B Q8: 85.1% accuracy (-0.1% loss).
  • Llama 3 70B Q5: 84.7% accuracy (-0.5% loss).
  • Llama 3 70B Q4: 84.0% accuracy (-1.2% loss).
  • Llama 3 70B Q3: 81.5% accuracy (-3.7% loss).
  • Real-world impact: Q4 vs Q8 = 1–2% fewer correct answers out of 100 questions.
  • For chat/writing: imperceptible difference. For STEM problems: Q8 safer.

When to Use Each Level

Q4: Default. Use for all models. Sweet spot of compression + quality.

Q5: Never. Wasteful. If you need Q5 quality, use Q4 with slightly larger model. If you have Q5's VRAM (88GB), use Q4 on 70B instead.

Q8: Only if you have 32GB+ VRAM AND model is <70B AND you need perfect accuracy (research, medical use).

Q3: Budget squeeze. 3% quality loss acceptable? Use Q3. Otherwise, upgrade GPU or use smaller model.

Q2: Desperation. Quality loss too high for most. Use only if OOM on Q3.

Q4 Deep Dive: Why It's the Standard

Q4 is optimal because:

1. 87.5% VRAM savings (best ratio).

2. <1.2% quality loss (imperceptible to users).

3. No speed penalty (memory-bound, not compute-bound).

4. Fits consumer hardware (70B on RTX 4090 24GB).

5. Industry standard (HuggingFace, Ollama default to Q4).

Every model released post-2024 includes a Q4 variant for production use.

If a model only has FP32/Q8/Q5, the project is not production-ready.

Common Misconceptions

  • Q4 sounds "low quality" because 4-bit seems small. False. 1% quality loss is imperceptible.
  • Quantization makes inference slower. False. Speed is identical (memory-bound, not compute-bound).
  • I should use Q8 to be safe. False. Q4 is proven, safe, and standard. Q8 is wasteful.
  • I need FP32 for accuracy. False. Never true. Q8 is sufficient even for research.

FAQ

Should I use Q4 or Q8 for coding?

Q4. Speed is identical, quality difference is 1%, which is imperceptible for code generation.

Can I use Q3 if I'm tight on VRAM?

Yes. 3% quality loss is acceptable for chat/creative writing. Unacceptable for reasoning/math.

Is there a Q6 or Q7?

No standard. Some projects implement custom levels, but Q4/Q5/Q8 are the industry standard.

Which quantization is fastest?

All identical speed (memory-bound). Q2 is slightly faster due to less memory transfer, but difference is <5%.

Can I dequantize Q4 back to FP32?

No, data is lost. Q4 → FP32 interpolation doesn't restore original. Quantization is one-way.

Should I quantize my fine-tuned model?

Yes, after training. Quantize the trained weights to Q4 for deployment.

Sources

  • MMLU benchmark: quantization impact on reasoning tasks (OpenAI Evals)
  • Llama model card: accuracy across quantization levels
  • Quantization research: "Towards Quantization-Aware Deep Neural Networks" (arXiv 2024)

使用PromptQuorum将您的本地LLM与25+个云模型同时进行比较。

免费试用PromptQuorum →

← 返回本地LLM

Quantization Levels Q4 vs Q5 vs Q8: VRAM, Speed, Quality Trade-offs | PromptQuorum