PromptQuorumPromptQuorum

What Is Q4_K_M Quantization?

Quick Answer

Q4_K_M means 4-bit quantization using k-quant (K) compression at medium (M) quality. It is the best default for most models: better quality than Q4_0, smaller than Q8_0.

  • β–ΈQ = quantized, 4 = 4-bit, K = k-quant, M = medium
  • β–ΈBetter quality than Q4_0 at the same file size
  • β–ΈUse Q4_K_M as your default quantization

Updated: 2026-05

Quantization & VRAMIntermediate

Key Takeaways

  • βœ“Q4_K_M = 4-bit quantization with k-quant compression at medium quality β€” better than Q4_0 at the same file size
  • βœ“A 7B model at Q4_K_M fits in ~4.1 GB on disk and needs ~5.5 GB VRAM to run
  • βœ“Use Q4_K_M as your default β€” it delivers the best quality-per-gigabyte for most VRAM budgets

What Each Letter in Q4_K_M Means

As of May 2026, Q4_K_M exists because old 4-bit formats (Q4_0) lost too much quality on critical weights. K-quant compression solves this by allocating more bits to weights that affect output most, and fewer bits to weights with minimal impact. The result: 5–8% better quality than Q4_0 at the same file size.

The "K" is the key differentiator. K-quant compression applies non-uniform bit allocation β€” critical weights get more bits, less important ones get fewer. This recovers 5–8% quality compared to the older Q4_0 format at the same file size.

The "M" is the quality setting within k-quant. Q4_K_S (small) is slightly smaller with lower quality. Q4_K_M (medium) is the best balance. Q4_K_L (large) is marginally better but rarely worth the extra size.

K-quant works by clustering weights and assigning bits based on importance. Top-importance clusters get 6 bits per weight. Mid-tier clusters get 4 bits. Low-importance clusters get 3 bits. The "M" tier averages 4.5 bits per weight across the model β€” explaining why Q4_K_M sits between Q4_K_S and Q5_K_M in both size and quality. For when the M tier is not enough, see Q4_K_M vs Q8_0.

How Q4_K_M Compares to Other Quantizations

The table below shows the tradeoffs for a 7B model. Quality is relative to the full-precision Q8_0 baseline. Unless you have 12+ GB VRAM, Q4_K_M gives the best quality-per-gigabyte.

For a direct comparison of Q4_K_M vs Q8_0, see the Q4_K_M vs Q8_0 decision guide. For the full quantization reference, see the quantization levels comparison.

FormatFile Size (7B)Quality vs Q8_0
Q4_03.8 GBBaseline (~87%)
Q4_K_M4.1 GB~92% (+5%)
Q5_K_M5.0 GB~95% (+3%)
Q8_07.7 GB100% (reference)

Quick Answers About Quantization

Is Q4_K_M the same as Q4_0?β–Ύ
No. Q4_K_M uses k-quant compression which recovers 5–8% quality over Q4_0 at the same bit depth. Always prefer Q4_K_M over Q4_0. See the Q4_K_M vs Q8_0 guide for when to go higher.
Which quantization should I use for 8 GB VRAM?β–Ύ
Q4_K_M for 7B models (5.5 GB VRAM). If you want better quality and have headroom, Q5_K_M uses 6.5 GB and adds ~3% quality. Both fit comfortably in 8 GB.
What does the 'M' in Q4_K_M stand for?β–Ύ
Medium β€” it refers to the quality tier within k-quant compression. Q4_K_S is the small (lower quality) variant, Q4_K_M is medium (recommended), and Q4_K_L is large (marginal gain over M).
Which models on Ollama use Q4_K_M by default?β–Ύ
Most of them β€” Llama 3, Mistral, Qwen, Phi, and Gemma all default to Q4_K_M tags. Specify :q5_K_M or :q8_0 in the model tag to switch quantization.