Quick Answer
Q4_K_M means 4-bit quantization using k-quant (K) compression at medium (M) quality. It is the best default for most models: better quality than Q4_0, smaller than Q8_0.
Updated: 2026-05
Key Takeaways
As of May 2026, Q4_K_M exists because old 4-bit formats (Q4_0) lost too much quality on critical weights. K-quant compression solves this by allocating more bits to weights that affect output most, and fewer bits to weights with minimal impact. The result: 5β8% better quality than Q4_0 at the same file size.
The "K" is the key differentiator. K-quant compression applies non-uniform bit allocation β critical weights get more bits, less important ones get fewer. This recovers 5β8% quality compared to the older Q4_0 format at the same file size.
The "M" is the quality setting within k-quant. Q4_K_S (small) is slightly smaller with lower quality. Q4_K_M (medium) is the best balance. Q4_K_L (large) is marginally better but rarely worth the extra size.
K-quant works by clustering weights and assigning bits based on importance. Top-importance clusters get 6 bits per weight. Mid-tier clusters get 4 bits. Low-importance clusters get 3 bits. The "M" tier averages 4.5 bits per weight across the model β explaining why Q4_K_M sits between Q4_K_S and Q5_K_M in both size and quality. For when the M tier is not enough, see Q4_K_M vs Q8_0.
The table below shows the tradeoffs for a 7B model. Quality is relative to the full-precision Q8_0 baseline. Unless you have 12+ GB VRAM, Q4_K_M gives the best quality-per-gigabyte.
For a direct comparison of Q4_K_M vs Q8_0, see the Q4_K_M vs Q8_0 decision guide. For the full quantization reference, see the quantization levels comparison.
| Format | File Size (7B) | Quality vs Q8_0 |
|---|---|---|
| Q4_0 | 3.8 GB | Baseline (~87%) |
| Q4_K_M | 4.1 GB | ~92% (+5%) |
| Q5_K_M | 5.0 GB | ~95% (+3%) |
| Q8_0 | 7.7 GB | 100% (reference) |
:q5_K_M or :q8_0 in the model tag to switch quantization.