Key Takeaways
- Quantization converts 16-bit model weights to 4-bit or 8-bit, reducing RAM by 50-75%.
- Q4_K_M is the standard recommended level -- best balance of quality and RAM for consumer hardware.
- A 7B model at FP16 = ~14 GB RAM. At Q4_K_M = ~4.5 GB. At Q8_0 = ~7 GB.
- Quality loss at Q4_K_M is 1-3% on MMLU benchmarks compared to FP16 -- imperceptible in most practical tasks.
- GGUF is the file format that stores quantized models for llama.cpp, Ollama, and LM Studio.
What Is LLM Quantization and Why Does It Matter?
Quantization converts 16-bit model weights (FP16) to 4-bit or 8-bit integers, reducing RAM by 50-75% with only 1-3% quality loss at Q4_K_M. A large language model stores its learned knowledge as billions of numerical weights. By default, these are stored as 16-bit floating-point numbers (FP16) -- two bytes per weight. A 7B model has 7 billion weights, so the FP16 file size is approximately 14 GB.
Quantization replaces these 16-bit floats with lower-precision integers. At 4-bit quantization, each weight uses 0.5 bytes instead of 2 -- cutting memory to ~3.5 GB for the weights alone. With metadata overhead, a quantized 7B model at Q4_K_M is approximately 4.5 GB.
This matters for local inference because consumer hardware has limited RAM. Without quantization, a 7B model requires 16 GB of RAM to run. With Q4_K_M quantization, the same model runs on 6 GB of RAM, making it accessible on most modern laptops.
What is Q4_K_M Quantization?
Q4_K_M is a 4-bit GGUF quantization format used in llama.cpp and Ollama. The "K" means it uses K-quants (mixed precision), and "M" = medium β a balance between model size, speed, and quality loss. Q4_K_M stores most weights at 4-bit but uses 6-bit for the most sensitive layers, giving it a better quality-to-size ratio than pure 4-bit Q4_0.
- Q4_K_M uses ~4.5 GB RAM for a 7B model β 70% less than FP16 β with only 1β3% quality loss
- K-quants apply different precision to different weight groups based on sensitivity (important weights get more bits)
- The "M" variant is the standard recommended version (lighter "S" and heavier "L" variants also exist)
- Q4_K_M is the default choice for consumer hardware with 6β16 GB VRAM
- Works with Ollama (`ollama run model:q4_k_m`), LM Studio, and llama.cpp
How Do Q4_K_M, Q5_K_M, Q8_0, and Other Levels Differ?
Q4_K_M at 4-bit is the standard recommendation -- approximately 4.5 GB RAM for a 7B model with only 1-3% quality loss vs FP16. Quantization names follow a pattern: Q{bits}_{variant}. The bit count is the weight precision; the variant affects how the quantization is applied:
| Level | Bits | RAM (7B) | Quality Loss | Use When |
|---|---|---|---|---|
| Q2_K | 2 | ~2.7 GB | High | RAM < 4 GB, accept quality degradation |
| Q3_K_S | 3 | ~3.3 GB | Moderate | RAM 4-5 GB |
| Q4_K_M | 4 | ~4.5 GB | Low (1-3%) | Default for most users |
| Q5_K_M | 5 | ~5.7 GB | Minimal (<1%) | 16 GB RAM, want better quality |
| Q6_K | 6 | ~6.6 GB | Near-lossless | 16 GB RAM, coding/math tasks |
| Q8_0 | 8 | ~7.7 GB | Negligible | 16+ GB RAM, maximum quality |
What Is GGUF Format and How Does It Relate to Quantization?
GGUF (GPT-Generated Unified Format) is the single-file standard for quantized LLM weights, containing model weights, metadata, and tokenizer -- used by Ollama, LM Studio, and llama.cpp. It was created by the llama.cpp project and replaces the older GGML format.
A GGUF file contains: the quantized model weights, all model metadata (architecture, tokenizer, context length), and a format version number. This self-contained design means a single `.gguf` file is everything needed to run the model -- no separate tokenizer files, no configuration JSON.
As of April 2026, GGUF is the standard format for Ollama, LM Studio, Jan AI, and GPT4All. When you run `ollama pull llama3.1:8b`, Ollama downloads a GGUF file internally. When LM Studio shows model file sizes, those are GGUF file sizes.
The quantization level is part of the filename: `Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf` is a Q4_K_M quantized GGUF of Llama 3.1 8B.
How Much RAM Does Quantization Save for Different Model Sizes?
| Model Size | FP16 | Q8_0 | Q4_K_M | Q3_K_S |
|---|---|---|---|---|
| 3B | ~6 GB | ~3.8 GB | ~2 GB | ~1.6 GB |
| 7B | ~14 GB | ~7.7 GB | ~4.5 GB | ~3.3 GB |
| 13B | ~26 GB | ~14 GB | ~8.5 GB | ~6 GB |
| 34B | ~68 GB | ~36 GB | ~22 GB | ~16 GB |
| 70B | ~140 GB | ~70 GB | ~40 GB | ~30 GB |
How Much Quality Do You Actually Lose with Quantization?
Q4_K_M loses 1-3% on MMLU benchmarks vs FP16 -- imperceptible in most practical tasks. Q3_K_S loses 5-10% and is noticeable on math and reasoning. Quality loss from quantization is measured by comparing benchmark scores between full-precision and quantized versions. As of April 2026, the established findings are:
Quantization reduces memory usage but can degrade output quality. Well-engineered prompts compensate: techniques like few-shot examples and explicit output constraints help quantized models maintain accuracy. See prompt engineering techniques for methods that work at any quantization level.
- Q4_K_M vs FP16: 1-3% degradation on MMLU. On a 7B model scoring 73% at FP16, Q4_K_M scores 71-72%. In practical tasks, this difference is imperceptible.
- Q3_K_S vs FP16: 5-10% degradation. Noticeable on complex reasoning and math tasks. A model that correctly solves a math problem at FP16 may fail at Q3_K_S.
- Q2_K vs FP16: 15-25% degradation. Significant quality loss across all task types. Only use when RAM constraint is absolute.
- Q8_0 vs FP16: under 0.5% degradation -- essentially identical for all practical purposes.
- The K_M variants (K-Quant Medium) use a mixed-precision approach that preserves quality better than older Q4_0 quantization at the same bit count. Always prefer Q4_K_M over Q4_0 when both are available.
Which Quantization Should You Use? (Quick Decision Tree)
Choose based on your available VRAM, not on the model size alone. The table below shows which quantization to select for different hardware constraints.
- For 6 GB RAM (most common laptop/desktop): Use Q4_K_M. A 7B model quantized to Q4_K_M is ~4.5 GB, leaving 1.5 GB for the OS and browser.
- For coding or math tasks: Use Q5_K_M or higher even if you have budget for Q4_K_M. Quantization effects (1β3% loss) are most visible on precise numerical reasoning. For an end-to-end air-gapped coding setup that pairs Q5_K_M Qwen3-Coder with no-internet operation, see Local Coding LLM Without Internet.
- Quantization + Temperature trade-off: A Q4_K_M model at temperature 0.3 produces more deterministic output than a full-precision (FP16) model at temperature 1.0. For independent tuning, see temperature and top-p: control AI creativity.
| Your VRAM | Best Quantization | Model Size | Quality |
|---|---|---|---|
| 4β6 GB | Q3_K_S or Q4_K_M | 3B, 7B (Q4) | 7B (Q3) | 5β10% loss (Q3) | 1β3% (Q4) |
| 6β8 GB | Q4_K_M (recommended) | 7B native | 1β3% loss (imperceptible) |
| 12β16 GB | Q5_K_M | 7B, 13B native | <1% loss (minimal) |
| 24 GB (RTX 4090) | Q5_K_M or Q6_K | 13B, 32B native | Q4 + offload for 70B | Negligible <0.5% |
| 32 GB (RTX 5090) | Q5_K_M, Q6_K, or Q8_0 | 70B at Q4 (35 GB), Q5 (43 GB) | 0β2% loss |
| 48+ GB (2Γ RTX 4090) | Q5_K_M or Q8_0 | 70B native with layer splitting | Negligible <0.5% |
LM Studio: How to Select Quantization in the UI
LM Studio (desktop app) shows available quantization variants for each model download. When searching for a model, you\'ll see multiple GGUF options: Q2_K, Q3_K_S, Q4_K_M, Q5_K_M, Q6_K, Q8_0.
Step 1: Open LM Studio β Navigate to the "Local Models" tab. Search for a model (e.g., "Llama 3.1 8B"). Step 2: Each model shows available quantizations. Look at the file size to estimate VRAM usage. Q4_K_M for a 7B model is usually listed as ~4.5 GB. Step 3: Click the download icon next to your chosen quantization.
Recommended defaults for LM Studio:
- If your GPU has 6-8 GB VRAM (RTX 4060, RTX 3060 Ti, RTX 4060 Ti): Download the Q4_K_M variant (smallest file with acceptable quality).
- If your GPU has 12-16 GB VRAM (RTX 4070, RTX 4080): Download Q5_K_M or Q6_K (better quality, still well within VRAM).
- If your GPU has 24+ GB VRAM (RTX 4090, RTX 5090): Download Q8_0 or FP16 (maximum quality, minimal speed penalty).
LM Studio\'s "GPU offload" feature: Check the "Use GPU" toggle in the chat interface. LM Studio will automatically move as many model layers to GPU as VRAM allows, offloading the rest to CPU RAM. If your system RAM is sufficient, this allows running models slightly larger than your GPU VRAM (e.g., Llama 3.3 70B Q4_K_M on RTX 4090 with 64+ GB system RAM).
Offloading: CPU RAM as Spillover
When VRAM is full, models can offload (move) layers to system RAM. Offloading trades speed for capacity.
Scenario: Running 70B Q4 model on RTX 4090 (24 GB). Model needs 35 GB. With offloading, run at ~5-10 tokens/sec (80% to RAM).
Offloading is a last resort -- it makes inference impractical. Use only for offline batch processing or experimentation.
# Ollama: enable offloading
export OLLAMA_NUM_GPU=0 # Disable GPU (force CPU)
ollama run llama3.3:70b
# vLLM: enable CPU offload (partial)
vllm serve meta-llama/Llama-3.3-70B-Instruct \
--gpu-memory-utilization 0.7 \
--cpu-offload-gb 10 # Offload 10GB to RAMLayer Splitting: Distribute Across Multiple GPUs
Modern inference engines (vLLM, llama.cpp) can split a model across multiple GPUs automatically. Learn more about Multi-GPU Local LLMs for advanced setups.
Example: 70B model with 2Γ RTX 4090:
- Without splitting: Impossible (needs 40+ GB VRAM in one GPU).
- With splitting: Half the model weights on each GPU. Inference speed: ~100 tokens/sec (communication overhead is minimal).
Layer splitting is practical for production deployments and is transparent to the user.
# vLLM: automatic tensor parallelism
vllm serve meta-llama/Llama-3.3-70B-Instruct \
--tensor-parallel-size 2 # Split across 2 GPUs
# llama.cpp: multi-GPU support
ollama run llama3.3:70b # Auto-detects and splits across GPUsKV Cache Quantization: Reducing Context Memory Overhead
KV Cache quantization reduces the memory required to store attention key-value pairs during inference, particularly important when processing long contexts (32K+ tokens). While model weight quantization (Q4_K_M) is most common, KV cache quantization targets a different memory bottleneck.
During inference, the model maintains running key-value (KV) pairs for each token in the context. For a 7B model processing a 32K-token context, KV cache alone can consume 8β16 GB of VRAM depending on the precision. Standard KV cache uses FP16 (2 bytes per value); quantizing the KV cache to FP8 or Q8 reduces this by 50%.
How to enable KV Cache quantization:
- Ollama: Automatic on compatible models; no user configuration needed.
- LM Studio: Check "KV cache quantization" toggle in Settings (if available on your version).
- llama.cpp: Use `--cache-type-q8_0` or `--cache-type-f8` flags when starting the server.
Trade-offs: KV cache quantization has minimal quality impact (<1% degradation even with aggressive quantization) because attention patterns are more robust to lower precision than model weights. Recommended for models processing 16K+ contexts on constrained hardware.
Hybrid Approach: Combining Techniques
Best results come from combining all three techniques. See VRAM Requirements Guide for specific hardware planning.
Scenario 1: 70B on single RTX 4090 (24 GB)
- Quantize to Q4 (35 GB β 18 GB)
- Use offloading for remaining 6 GB (to system RAM)
- Result: ~8-10 tokens/sec (slow but works)
Scenario 2: 70B on 2Γ RTX 4090
- Quantize to Q5 (43.75 GB)
- Use layer splitting across 2 GPUs (22 GB each)
- Result: ~100 tokens/sec (practical)
What Are the Performance Trade-offs?
Each technique trades VRAM reduction for speed penalties. Quantization has minimal impact; offloading causes 5β10Γ slowdown; layer splitting adds ~5% overhead.
| Technique | VRAM Saved | Speed Impact | Quality Impact |
|---|---|---|---|
| Quantization (Q4) | 50% | None (Β±5%) | Minor |
| Offloading (CPU RAM) | 60-80% | 5-10Γ slower | None |
| Layer splitting (2 GPUs) | N/A (enables larger models) | 5-10% slower | None |
| Quantization + Offloading | 75-90% | 3-5Γ slower | Minor |
Mac Studio M2 Ultra: Native 70B Without Offloading
Mac Studio M2 Ultra with 192 GB unified memory runs Llama 3.3 70B at Q4 natively β no offloading, no layer splitting required.
Unified memory bandwidth: Mac Studio M2 Ultra accesses both CPU and GPU memory at ~800 GB/s. DDR5 system RAM offloading is capped at ~90 GB/s. This 9Γ advantage eliminates the speed penalty that makes offloading impractical.
| Setup | Model | Speed | Complexity |
|---|---|---|---|
| 1Γ RTX 4090 + offloading | Llama 3.3 70B Q4 | 5β10 tok/sec | Medium |
| 2Γ RTX 4090 layer split | Llama 3.3 70B Q5 | ~100 tok/sec | High |
| 1Γ RTX 5090 (32 GB) | Llama 3.3 70B Q4 | 10β12 tok/sec | Low |
| Mac Studio M2 Ultra | Llama 3.3 70B Q4 | 35 tok/sec | Low (plug & play) |
LLM Quantization: Regional Context
- EU (GDPR, Article 44) -- Cross-border AI data transfers require adequacy decisions or Standard Contractual Clauses. Q4_K_M quantization enables 7B models to run on 8 GB edge devices, eliminating third-party cloud API calls entirely. The German BfDI and French CNIL both recommend local inference for high-risk AI processing under GDPR Article 22. Quantized Mistral and Llama models are the dominant choices in EU enterprise deployments for this reason.
- Japan (METI AI Governance Guidelines 2024) -- Japan's Ministry of Economy, Trade and Industry requires AI governance documentation for enterprise deployments. Quantized models on domestic infrastructure satisfy METI's "controllability" requirements -- the model weights stay on-premises. Q4_K_M quantization makes 13B-32B models feasible on 16-32 GB corporate servers without GPU clusters. Qwen2.5 and Llama 3 are the most-deployed families in Japanese enterprise settings.
- China (CAC Generative AI Regulations 2023) -- China's Cyberspace Administration requires security assessments for publicly deployed AI and data localization for user data. Quantized Chinese-native models (Qwen2.5, Baichuan2, Yi) run entirely on domestic hardware, satisfying CAC localization requirements. Q4_K_M and Q5_K_M quantization reduce hardware costs by 60-70% versus FP16, making on-premises CAC compliance economically viable for mid-sized enterprises.
What Are the Common Mistakes with LLM Quantization?
- Downloading Q4_0 instead of Q4_K_M -- Q4_0 is an older quantization method without K-Quant improvements. Q4_K_M is 5-8% better quality at the same RAM footprint. When both are available, always choose Q4_K_M.
- Assuming higher quantization always means worse quality -- Higher Q number = more bits = better quality. Q8_0 is better than Q4_K_M. Q5_K_M is better than Q4_K_M. A Q4_K_M 70B model will outperform a Q8_0 7B model on most tasks.
- Not checking RAM headroom before loading a model -- The model size is not the only RAM consumer. OS, browser, and other applications use RAM too. On an 8 GB machine, a 4.5 GB Q4_K_M 7B model leaves only 3.5 GB for everything else. Rule: model file size + 2 GB OS overhead + 1 GB headroom = minimum required RAM.
Common Questions About LLM Quantization
Does Ollama automatically use the best quantization?
Yes -- when you run `ollama pull llama3.1:8b`, Ollama downloads the Q4_K_M variant by default. To pull a specific quantization, append the tag: `ollama pull llama3.1:8b-instruct-q5_K_M`. Available quantization tags for each model are listed on the model's page at ollama.com/library.
Can I quantize a model myself instead of downloading a pre-quantized version?
Yes -- llama.cpp includes a `quantize` binary that converts GGUF files to any supported quantization level. The process takes 5-30 minutes depending on model size. Most users should download pre-quantized GGUF files from Hugging Face rather than quantizing themselves, as the results are equivalent.
Does quantization affect the model's context window?
No -- quantization only affects model weight precision, not the context length. A Llama 3.1 8B model supports 128K tokens whether quantized to Q4_K_M or run at FP16. However, processing longer contexts requires more RAM regardless of quantization -- processing a 64K token context with a Q4_K_M 7B model may require 10+ GB RAM.
What is the difference between GGUF and GPTQ quantization?
GGUF (llama.cpp format) and GPTQ are two different quantization approaches. GGUF uses K-Quants and runs on CPU and GPU. GPTQ is GPU-only and requires PyTorch. For local inference with Ollama, LM Studio, or Jan AI, GGUF is the correct format. GPTQ is used with GPU-focused inference frameworks like AutoGPTQ and vLLM.
Is there a quality difference between Q4_K_M models from different providers on Hugging Face?
The quantization algorithm is standardized in llama.cpp, so Q4_K_M quantizations of the same base model should be nearly identical regardless of who created the GGUF file. However, some providers apply additional adjustments (imatrix quantization) that improve quality. Files described as "imat" or "importance matrix" quantized are generally higher quality at the same bit count.
What is the difference between Q4_K_M and Q4_0?
Q4_K_M and Q4_0 are both 4-bit quantization, but they use different algorithms. Q4_0 is the original uniform 4-bit format from early llama.cpp. Q4_K_M is a K-Quant introduced in 2023 -- it groups weights into blocks and applies mixed precision within each block, recovering 5-8% quality at the same RAM footprint. When you see both on Hugging Face, always choose Q4_K_M. Q4_0 only exists for legacy compatibility.
What is imatrix quantization?
Imatrix (importance matrix) quantization uses calibration data to assign different precision levels to different weights based on their importance to model output. Weights that most affect predictions are quantized with more bits; less important weights use fewer bits. Result: better quality at the same bit count compared to uniform quantization. Qwen2.5 imatrix quantizations are 2-4% better than standard Q4_K_M.
What's the difference between Q4_K_M and Q4_K_S?
Both are 4-bit quantization, but K_M (Medium) and K_S (Small) differ in memory allocation per quantization block. Q4_K_M uses more metadata for better quality reconstruction -- typically 4.5-5 GB for a 7B model. Q4_K_S is more aggressive -- saves 300-400 MB compared to K_M but with 3-5% quality loss. Use Q4_K_M unless you're on extremely constrained hardware (< 4 GB RAM).
Can I switch between quantization levels without redownloading the model?
No -- switching quantization levels requires downloading a different GGUF file or re-quantizing the base model yourself. Once a model is quantized to Q4_K_M, you cannot convert it back to Q5_K_M without the original FP16 model. Most users download pre-quantized GGUF files from Hugging Face for their desired quantization level.
How does quantization affect inference speed?
Quantization typically increases inference speed by 10-40% because loading and processing 4-bit weights is faster than 16-bit floats. A Q4_K_M 7B model runs at ~8-12 tok/s on a consumer CPU; the same model at FP16 runs at ~1-2 tok/s. GPU performance gain from quantization is smaller (5-15% faster) because GPUs are already optimized for float arithmetic.
What quantization level does Ollama use by default?
Ollama defaults to Q4_K_M for all models in its library. When you run `ollama pull llama3.1:8b`, you're downloading the Q4_K_M variant. This default balances quality and RAM requirements well for most users. To pull a different quantization, append the tag: `ollama pull llama3.1:8b:q5_k_m` or `ollama pull llama3.1:8b:q8_0`.
Can I run Llama 3.3 70B on a single RTX 4090?
Yes, but slowly. Quantize to Q4 (35 GB), offload 11 GB to system RAM. Expect 5-10 tok/sec β too slow for real-time chat, fine for batch processing. For practical 70B inference: 2Γ RTX 4090 with layer splitting (~100 tok/sec) or Mac Studio M2 Ultra (35 tok/sec native).
What is the difference between quantization and offloading?
Quantization reduces model weight precision permanently (FP16 β Q4), shrinking the model file. Offloading moves model layers from VRAM to system RAM at runtime. Quantization has minimal quality impact (Β±5%); offloading causes 5β10Γ speed degradation. Use quantization first, offloading as last resort.
Does Mac Studio M2 Ultra need quantization for 70B models?
Only mild quantization. 192 GB unified memory holds Llama 3.3 70B at Q4 (35 GB) natively β no offloading or layer splitting. At Q5, 70B still fits (44 GB). FP16 70B (140 GB) also fits but runs slower. Q4 is the sweet spot for Mac Studio 70B workflows.
Which technique combination is best for my hardware?
Single RTX 4090 (24 GB): Q4 + offloading for 70B (slow). Q5 native for 32B (fast). 2Γ RTX 4090 (48 GB): Q5 + layer splitting for 70B (100 tok/sec). RTX 5090 (32 GB): Q4 native for 70B (10-12 tok/sec). Mac Studio M2 Ultra (192 GB): Q4 native for 70B (35 tok/sec).
Sources
- llama.cpp Quantization Documentation
- K-Quants Technical Discussion -- original K-Quant PR
- GGUF Format Specification
- Open LLM Leaderboard -- quantization benchmarks
Update Log
- 2026-05-17: Updated title to reflect decision-focused intent; content unchanged.