PromptQuorumPromptQuorum
Home/Local LLMs/LLM Quantization: Q4 vs Q5 vs Q8 Explained (When to Use Each)
Best Models

LLM Quantization: Q4 vs Q5 vs Q8 Explained (When to Use Each)

Β·14 min readΒ·By Hans Kuepper Β· Founder of PromptQuorum, multi-model AI dispatch tool Β· PromptQuorum

Choose quantization based on VRAM: 6–8 GB VRAM β†’ use Q4_K_M (~4.5 GB for 7B models, 1–3% quality loss), 16 GB β†’ Q5_K_M, 24+ GB β†’ Q8_0 (negligible loss). Quantization reduces model weight precision from 16-bit floats to 4- or 8-bit integers, cutting RAM by 50–75%. For models larger than your GPU, add CPU offloading or multi-GPU layer splitting.

Complete guide to choosing the right LLM quantization for your hardware: Q4_K_M for 6–8 GB VRAM, Q5_K_M for 16 GB, Q8_0 for 24+ GB. Includes GGUF format explained, quality loss breakdown by quantization level, and advanced techniques (CPU offloading and multi-GPU layer splitting). Learn how to run Llama 3.3 70B on RTX 4090 via offloading, 2Γ— RTX 4090 via layer splitting, or Mac Studio M2 Ultra natively. Updated May 2026.

Slide Deck: LLM Quantization: Q4 vs Q5 vs Q8 Explained (When to Use Each)

The slide deck below covers: Q4_K_M vs Q8_0 vs GGUF format comparison, RAM savings by model size (3B-70B), quality loss by quantization level, and which quantization to choose for your hardware. Download the PDF as an LLM quantization reference card.

Browse the slides below or download as PDF for offline reference. Download Reference Card (PDF)

Key Takeaways

  • Quantization converts 16-bit model weights to 4-bit or 8-bit, reducing RAM by 50-75%.
  • Q4_K_M is the standard recommended level -- best balance of quality and RAM for consumer hardware.
  • A 7B model at FP16 = ~14 GB RAM. At Q4_K_M = ~4.5 GB. At Q8_0 = ~7 GB.
  • Quality loss at Q4_K_M is 1-3% on MMLU benchmarks compared to FP16 -- imperceptible in most practical tasks.
  • GGUF is the file format that stores quantized models for llama.cpp, Ollama, and LM Studio.

What Is LLM Quantization and Why Does It Matter?

Quantization converts 16-bit model weights (FP16) to 4-bit or 8-bit integers, reducing RAM by 50-75% with only 1-3% quality loss at Q4_K_M. A large language model stores its learned knowledge as billions of numerical weights. By default, these are stored as 16-bit floating-point numbers (FP16) -- two bytes per weight. A 7B model has 7 billion weights, so the FP16 file size is approximately 14 GB.

Quantization replaces these 16-bit floats with lower-precision integers. At 4-bit quantization, each weight uses 0.5 bytes instead of 2 -- cutting memory to ~3.5 GB for the weights alone. With metadata overhead, a quantized 7B model at Q4_K_M is approximately 4.5 GB.

This matters for local inference because consumer hardware has limited RAM. Without quantization, a 7B model requires 16 GB of RAM to run. With Q4_K_M quantization, the same model runs on 6 GB of RAM, making it accessible on most modern laptops.

What is Q4_K_M Quantization?

Q4_K_M is a 4-bit GGUF quantization format used in llama.cpp and Ollama. The "K" means it uses K-quants (mixed precision), and "M" = medium β€” a balance between model size, speed, and quality loss. Q4_K_M stores most weights at 4-bit but uses 6-bit for the most sensitive layers, giving it a better quality-to-size ratio than pure 4-bit Q4_0.

  • Q4_K_M uses ~4.5 GB RAM for a 7B model β€” 70% less than FP16 β€” with only 1–3% quality loss
  • K-quants apply different precision to different weight groups based on sensitivity (important weights get more bits)
  • The "M" variant is the standard recommended version (lighter "S" and heavier "L" variants also exist)
  • Q4_K_M is the default choice for consumer hardware with 6–16 GB VRAM
  • Works with Ollama (`ollama run model:q4_k_m`), LM Studio, and llama.cpp

How Do Q4_K_M, Q5_K_M, Q8_0, and Other Levels Differ?

Q4_K_M at 4-bit is the standard recommendation -- approximately 4.5 GB RAM for a 7B model with only 1-3% quality loss vs FP16. Quantization names follow a pattern: Q{bits}_{variant}. The bit count is the weight precision; the variant affects how the quantization is applied:

LevelBitsRAM (7B)Quality LossUse When
Q2_K2~2.7 GBHighRAM < 4 GB, accept quality degradation
Q3_K_S3~3.3 GBModerateRAM 4-5 GB
Q4_K_M4~4.5 GBLow (1-3%)Default for most users
Q5_K_M5~5.7 GBMinimal (<1%)16 GB RAM, want better quality
Q6_K6~6.6 GBNear-lossless16 GB RAM, coding/math tasks
Q8_08~7.7 GBNegligible16+ GB RAM, maximum quality
Quantization levels compared: from Q2_K (highest compression) to Q8_0 (highest quality). Q4_K_M is the recommended standard for most users.
Quantization levels compared: from Q2_K (highest compression) to Q8_0 (highest quality). Q4_K_M is the recommended standard for most users.

What Is GGUF Format and How Does It Relate to Quantization?

GGUF (GPT-Generated Unified Format) is the single-file standard for quantized LLM weights, containing model weights, metadata, and tokenizer -- used by Ollama, LM Studio, and llama.cpp. It was created by the llama.cpp project and replaces the older GGML format.

A GGUF file contains: the quantized model weights, all model metadata (architecture, tokenizer, context length), and a format version number. This self-contained design means a single `.gguf` file is everything needed to run the model -- no separate tokenizer files, no configuration JSON.

As of April 2026, GGUF is the standard format for Ollama, LM Studio, Jan AI, and GPT4All. When you run `ollama pull llama3.1:8b`, Ollama downloads a GGUF file internally. When LM Studio shows model file sizes, those are GGUF file sizes.

The quantization level is part of the filename: `Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf` is a Q4_K_M quantized GGUF of Llama 3.1 8B.

GGUF format contains quantized weights, model metadata (tokenizer, context length), and format version in a single self-contained file.
GGUF format contains quantized weights, model metadata (tokenizer, context length), and format version in a single self-contained file.

How Much RAM Does Quantization Save for Different Model Sizes?

Model SizeFP16Q8_0Q4_K_MQ3_K_S
3B~6 GB~3.8 GB~2 GB~1.6 GB
7B~14 GB~7.7 GB~4.5 GB~3.3 GB
13B~26 GB~14 GB~8.5 GB~6 GB
34B~68 GB~36 GB~22 GB~16 GB
70B~140 GB~70 GB~40 GB~30 GB
RAM savings across model sizes: 3B through 70B models at FP16, Q8_0, Q4_K_M, and Q3_K_S quantization levels.
RAM savings across model sizes: 3B through 70B models at FP16, Q8_0, Q4_K_M, and Q3_K_S quantization levels.

How Much Quality Do You Actually Lose with Quantization?

Q4_K_M loses 1-3% on MMLU benchmarks vs FP16 -- imperceptible in most practical tasks. Q3_K_S loses 5-10% and is noticeable on math and reasoning. Quality loss from quantization is measured by comparing benchmark scores between full-precision and quantized versions. As of April 2026, the established findings are:

Quantization reduces memory usage but can degrade output quality. Well-engineered prompts compensate: techniques like few-shot examples and explicit output constraints help quantized models maintain accuracy. See prompt engineering techniques for methods that work at any quantization level.

  • Q4_K_M vs FP16: 1-3% degradation on MMLU. On a 7B model scoring 73% at FP16, Q4_K_M scores 71-72%. In practical tasks, this difference is imperceptible.
  • Q3_K_S vs FP16: 5-10% degradation. Noticeable on complex reasoning and math tasks. A model that correctly solves a math problem at FP16 may fail at Q3_K_S.
  • Q2_K vs FP16: 15-25% degradation. Significant quality loss across all task types. Only use when RAM constraint is absolute.
  • Q8_0 vs FP16: under 0.5% degradation -- essentially identical for all practical purposes.
  • The K_M variants (K-Quant Medium) use a mixed-precision approach that preserves quality better than older Q4_0 quantization at the same bit count. Always prefer Q4_K_M over Q4_0 when both are available.

Which Quantization Should You Use? (Quick Decision Tree)

Choose based on your available VRAM, not on the model size alone. The table below shows which quantization to select for different hardware constraints.

  • For 6 GB RAM (most common laptop/desktop): Use Q4_K_M. A 7B model quantized to Q4_K_M is ~4.5 GB, leaving 1.5 GB for the OS and browser.
  • For coding or math tasks: Use Q5_K_M or higher even if you have budget for Q4_K_M. Quantization effects (1–3% loss) are most visible on precise numerical reasoning. For an end-to-end air-gapped coding setup that pairs Q5_K_M Qwen3-Coder with no-internet operation, see Local Coding LLM Without Internet.
  • Quantization + Temperature trade-off: A Q4_K_M model at temperature 0.3 produces more deterministic output than a full-precision (FP16) model at temperature 1.0. For independent tuning, see temperature and top-p: control AI creativity.
Your VRAMBest QuantizationModel SizeQuality
4–6 GBQ3_K_S or Q4_K_M3B, 7B (Q4) | 7B (Q3)5–10% loss (Q3) | 1–3% (Q4)
6–8 GBQ4_K_M (recommended)7B native1–3% loss (imperceptible)
12–16 GBQ5_K_M7B, 13B native<1% loss (minimal)
24 GB (RTX 4090)Q5_K_M or Q6_K13B, 32B native | Q4 + offload for 70BNegligible <0.5%
32 GB (RTX 5090)Q5_K_M, Q6_K, or Q8_070B at Q4 (35 GB), Q5 (43 GB)0–2% loss
48+ GB (2Γ— RTX 4090)Q5_K_M or Q8_070B native with layer splittingNegligible <0.5%

LM Studio: How to Select Quantization in the UI

LM Studio (desktop app) shows available quantization variants for each model download. When searching for a model, you\'ll see multiple GGUF options: Q2_K, Q3_K_S, Q4_K_M, Q5_K_M, Q6_K, Q8_0.

Step 1: Open LM Studio β†’ Navigate to the "Local Models" tab. Search for a model (e.g., "Llama 3.1 8B"). Step 2: Each model shows available quantizations. Look at the file size to estimate VRAM usage. Q4_K_M for a 7B model is usually listed as ~4.5 GB. Step 3: Click the download icon next to your chosen quantization.

Recommended defaults for LM Studio:

- If your GPU has 6-8 GB VRAM (RTX 4060, RTX 3060 Ti, RTX 4060 Ti): Download the Q4_K_M variant (smallest file with acceptable quality).

- If your GPU has 12-16 GB VRAM (RTX 4070, RTX 4080): Download Q5_K_M or Q6_K (better quality, still well within VRAM).

- If your GPU has 24+ GB VRAM (RTX 4090, RTX 5090): Download Q8_0 or FP16 (maximum quality, minimal speed penalty).

LM Studio\'s "GPU offload" feature: Check the "Use GPU" toggle in the chat interface. LM Studio will automatically move as many model layers to GPU as VRAM allows, offloading the rest to CPU RAM. If your system RAM is sufficient, this allows running models slightly larger than your GPU VRAM (e.g., Llama 3.3 70B Q4_K_M on RTX 4090 with 64+ GB system RAM).

Offloading: CPU RAM as Spillover

When VRAM is full, models can offload (move) layers to system RAM. Offloading trades speed for capacity.

Scenario: Running 70B Q4 model on RTX 4090 (24 GB). Model needs 35 GB. With offloading, run at ~5-10 tokens/sec (80% to RAM).

Offloading is a last resort -- it makes inference impractical. Use only for offline batch processing or experimentation.

bash
# Ollama: enable offloading
export OLLAMA_NUM_GPU=0  # Disable GPU (force CPU)
ollama run llama3.3:70b

# vLLM: enable CPU offload (partial)
vllm serve meta-llama/Llama-3.3-70B-Instruct \
  --gpu-memory-utilization 0.7 \
  --cpu-offload-gb 10  # Offload 10GB to RAM

Layer Splitting: Distribute Across Multiple GPUs

Modern inference engines (vLLM, llama.cpp) can split a model across multiple GPUs automatically. Learn more about Multi-GPU Local LLMs for advanced setups.

Example: 70B model with 2Γ— RTX 4090:

- Without splitting: Impossible (needs 40+ GB VRAM in one GPU).

- With splitting: Half the model weights on each GPU. Inference speed: ~100 tokens/sec (communication overhead is minimal).

Layer splitting is practical for production deployments and is transparent to the user.

bash
# vLLM: automatic tensor parallelism
vllm serve meta-llama/Llama-3.3-70B-Instruct \
  --tensor-parallel-size 2  # Split across 2 GPUs

# llama.cpp: multi-GPU support
ollama run llama3.3:70b  # Auto-detects and splits across GPUs

KV Cache Quantization: Reducing Context Memory Overhead

KV Cache quantization reduces the memory required to store attention key-value pairs during inference, particularly important when processing long contexts (32K+ tokens). While model weight quantization (Q4_K_M) is most common, KV cache quantization targets a different memory bottleneck.

During inference, the model maintains running key-value (KV) pairs for each token in the context. For a 7B model processing a 32K-token context, KV cache alone can consume 8–16 GB of VRAM depending on the precision. Standard KV cache uses FP16 (2 bytes per value); quantizing the KV cache to FP8 or Q8 reduces this by 50%.

How to enable KV Cache quantization:

- Ollama: Automatic on compatible models; no user configuration needed.

- LM Studio: Check "KV cache quantization" toggle in Settings (if available on your version).

- llama.cpp: Use `--cache-type-q8_0` or `--cache-type-f8` flags when starting the server.

Trade-offs: KV cache quantization has minimal quality impact (<1% degradation even with aggressive quantization) because attention patterns are more robust to lower precision than model weights. Recommended for models processing 16K+ contexts on constrained hardware.

Hybrid Approach: Combining Techniques

Best results come from combining all three techniques. See VRAM Requirements Guide for specific hardware planning.

Scenario 1: 70B on single RTX 4090 (24 GB)

- Quantize to Q4 (35 GB β†’ 18 GB)

- Use offloading for remaining 6 GB (to system RAM)

- Result: ~8-10 tokens/sec (slow but works)

Scenario 2: 70B on 2Γ— RTX 4090

- Quantize to Q5 (43.75 GB)

- Use layer splitting across 2 GPUs (22 GB each)

- Result: ~100 tokens/sec (practical)

What Are the Performance Trade-offs?

Each technique trades VRAM reduction for speed penalties. Quantization has minimal impact; offloading causes 5–10Γ— slowdown; layer splitting adds ~5% overhead.

TechniqueVRAM SavedSpeed ImpactQuality Impact
Quantization (Q4)50%None (Β±5%)Minor
Offloading (CPU RAM)60-80%5-10Γ— slowerNone
Layer splitting (2 GPUs)N/A (enables larger models)5-10% slowerNone
Quantization + Offloading75-90%3-5Γ— slowerMinor

Mac Studio M2 Ultra: Native 70B Without Offloading

Mac Studio M2 Ultra with 192 GB unified memory runs Llama 3.3 70B at Q4 natively β€” no offloading, no layer splitting required.

Unified memory bandwidth: Mac Studio M2 Ultra accesses both CPU and GPU memory at ~800 GB/s. DDR5 system RAM offloading is capped at ~90 GB/s. This 9Γ— advantage eliminates the speed penalty that makes offloading impractical.

SetupModelSpeedComplexity
1Γ— RTX 4090 + offloadingLlama 3.3 70B Q45–10 tok/secMedium
2Γ— RTX 4090 layer splitLlama 3.3 70B Q5~100 tok/secHigh
1Γ— RTX 5090 (32 GB)Llama 3.3 70B Q410–12 tok/secLow
Mac Studio M2 UltraLlama 3.3 70B Q435 tok/secLow (plug & play)

LLM Quantization: Regional Context

  • EU (GDPR, Article 44) -- Cross-border AI data transfers require adequacy decisions or Standard Contractual Clauses. Q4_K_M quantization enables 7B models to run on 8 GB edge devices, eliminating third-party cloud API calls entirely. The German BfDI and French CNIL both recommend local inference for high-risk AI processing under GDPR Article 22. Quantized Mistral and Llama models are the dominant choices in EU enterprise deployments for this reason.
  • Japan (METI AI Governance Guidelines 2024) -- Japan's Ministry of Economy, Trade and Industry requires AI governance documentation for enterprise deployments. Quantized models on domestic infrastructure satisfy METI's "controllability" requirements -- the model weights stay on-premises. Q4_K_M quantization makes 13B-32B models feasible on 16-32 GB corporate servers without GPU clusters. Qwen2.5 and Llama 3 are the most-deployed families in Japanese enterprise settings.
  • China (CAC Generative AI Regulations 2023) -- China's Cyberspace Administration requires security assessments for publicly deployed AI and data localization for user data. Quantized Chinese-native models (Qwen2.5, Baichuan2, Yi) run entirely on domestic hardware, satisfying CAC localization requirements. Q4_K_M and Q5_K_M quantization reduce hardware costs by 60-70% versus FP16, making on-premises CAC compliance economically viable for mid-sized enterprises.

What Are the Common Mistakes with LLM Quantization?

  • Downloading Q4_0 instead of Q4_K_M -- Q4_0 is an older quantization method without K-Quant improvements. Q4_K_M is 5-8% better quality at the same RAM footprint. When both are available, always choose Q4_K_M.
  • Assuming higher quantization always means worse quality -- Higher Q number = more bits = better quality. Q8_0 is better than Q4_K_M. Q5_K_M is better than Q4_K_M. A Q4_K_M 70B model will outperform a Q8_0 7B model on most tasks.
  • Not checking RAM headroom before loading a model -- The model size is not the only RAM consumer. OS, browser, and other applications use RAM too. On an 8 GB machine, a 4.5 GB Q4_K_M 7B model leaves only 3.5 GB for everything else. Rule: model file size + 2 GB OS overhead + 1 GB headroom = minimum required RAM.

Common Questions About LLM Quantization

Does Ollama automatically use the best quantization?

Yes -- when you run `ollama pull llama3.1:8b`, Ollama downloads the Q4_K_M variant by default. To pull a specific quantization, append the tag: `ollama pull llama3.1:8b-instruct-q5_K_M`. Available quantization tags for each model are listed on the model's page at ollama.com/library.

Can I quantize a model myself instead of downloading a pre-quantized version?

Yes -- llama.cpp includes a `quantize` binary that converts GGUF files to any supported quantization level. The process takes 5-30 minutes depending on model size. Most users should download pre-quantized GGUF files from Hugging Face rather than quantizing themselves, as the results are equivalent.

Does quantization affect the model's context window?

No -- quantization only affects model weight precision, not the context length. A Llama 3.1 8B model supports 128K tokens whether quantized to Q4_K_M or run at FP16. However, processing longer contexts requires more RAM regardless of quantization -- processing a 64K token context with a Q4_K_M 7B model may require 10+ GB RAM.

What is the difference between GGUF and GPTQ quantization?

GGUF (llama.cpp format) and GPTQ are two different quantization approaches. GGUF uses K-Quants and runs on CPU and GPU. GPTQ is GPU-only and requires PyTorch. For local inference with Ollama, LM Studio, or Jan AI, GGUF is the correct format. GPTQ is used with GPU-focused inference frameworks like AutoGPTQ and vLLM.

Is there a quality difference between Q4_K_M models from different providers on Hugging Face?

The quantization algorithm is standardized in llama.cpp, so Q4_K_M quantizations of the same base model should be nearly identical regardless of who created the GGUF file. However, some providers apply additional adjustments (imatrix quantization) that improve quality. Files described as "imat" or "importance matrix" quantized are generally higher quality at the same bit count.

What is the difference between Q4_K_M and Q4_0?

Q4_K_M and Q4_0 are both 4-bit quantization, but they use different algorithms. Q4_0 is the original uniform 4-bit format from early llama.cpp. Q4_K_M is a K-Quant introduced in 2023 -- it groups weights into blocks and applies mixed precision within each block, recovering 5-8% quality at the same RAM footprint. When you see both on Hugging Face, always choose Q4_K_M. Q4_0 only exists for legacy compatibility.

What is imatrix quantization?

Imatrix (importance matrix) quantization uses calibration data to assign different precision levels to different weights based on their importance to model output. Weights that most affect predictions are quantized with more bits; less important weights use fewer bits. Result: better quality at the same bit count compared to uniform quantization. Qwen2.5 imatrix quantizations are 2-4% better than standard Q4_K_M.

What's the difference between Q4_K_M and Q4_K_S?

Both are 4-bit quantization, but K_M (Medium) and K_S (Small) differ in memory allocation per quantization block. Q4_K_M uses more metadata for better quality reconstruction -- typically 4.5-5 GB for a 7B model. Q4_K_S is more aggressive -- saves 300-400 MB compared to K_M but with 3-5% quality loss. Use Q4_K_M unless you're on extremely constrained hardware (< 4 GB RAM).

Can I switch between quantization levels without redownloading the model?

No -- switching quantization levels requires downloading a different GGUF file or re-quantizing the base model yourself. Once a model is quantized to Q4_K_M, you cannot convert it back to Q5_K_M without the original FP16 model. Most users download pre-quantized GGUF files from Hugging Face for their desired quantization level.

How does quantization affect inference speed?

Quantization typically increases inference speed by 10-40% because loading and processing 4-bit weights is faster than 16-bit floats. A Q4_K_M 7B model runs at ~8-12 tok/s on a consumer CPU; the same model at FP16 runs at ~1-2 tok/s. GPU performance gain from quantization is smaller (5-15% faster) because GPUs are already optimized for float arithmetic.

What quantization level does Ollama use by default?

Ollama defaults to Q4_K_M for all models in its library. When you run `ollama pull llama3.1:8b`, you're downloading the Q4_K_M variant. This default balances quality and RAM requirements well for most users. To pull a different quantization, append the tag: `ollama pull llama3.1:8b:q5_k_m` or `ollama pull llama3.1:8b:q8_0`.

Can I run Llama 3.3 70B on a single RTX 4090?

Yes, but slowly. Quantize to Q4 (35 GB), offload 11 GB to system RAM. Expect 5-10 tok/sec β€” too slow for real-time chat, fine for batch processing. For practical 70B inference: 2Γ— RTX 4090 with layer splitting (~100 tok/sec) or Mac Studio M2 Ultra (35 tok/sec native).

What is the difference between quantization and offloading?

Quantization reduces model weight precision permanently (FP16 β†’ Q4), shrinking the model file. Offloading moves model layers from VRAM to system RAM at runtime. Quantization has minimal quality impact (Β±5%); offloading causes 5–10Γ— speed degradation. Use quantization first, offloading as last resort.

Does Mac Studio M2 Ultra need quantization for 70B models?

Only mild quantization. 192 GB unified memory holds Llama 3.3 70B at Q4 (35 GB) natively β€” no offloading or layer splitting. At Q5, 70B still fits (44 GB). FP16 70B (140 GB) also fits but runs slower. Q4 is the sweet spot for Mac Studio 70B workflows.

Which technique combination is best for my hardware?

Single RTX 4090 (24 GB): Q4 + offloading for 70B (slow). Q5 native for 32B (fast). 2Γ— RTX 4090 (48 GB): Q5 + layer splitting for 70B (100 tok/sec). RTX 5090 (32 GB): Q4 native for 70B (10-12 tok/sec). Mac Studio M2 Ultra (192 GB): Q4 native for 70B (35 tok/sec).

Update Log

  • 2026-05-17: Updated title to reflect decision-focused intent; content unchanged.

A Note on Third-Party Facts

This article references third-party AI models, benchmarks, prices, and licenses. The AI landscape changes rapidly. Benchmark scores, license terms, model names, and API prices can shift between the time of writing and the time you read this. Before making deployment or compliance decisions based on this article, verify current figures on each provider's official source: Hugging Face model cards for licenses and benchmarks, provider websites for API pricing, and EUR-Lex for current GDPR and EU AI Act text. This article reflects publicly available information as of May 2026.

Compare your local LLM against 25+ cloud models simultaneously with PromptQuorum.

Join the PromptQuorum Waitlist β†’

← Back to Local LLMs

LLM Quantization: Q4 vs Q5 vs Q8 Explained (When to Use Each)