Quick Answer
Use Q4_K_M if you have 8 GB VRAM or less. Use Q8_0 if you have 12+ GB. Q4_K_M delivers 95% of Q8_0 quality at roughly half the file size.
Updated: 2026-05
Key Takeaways
As of May 2026, Q8_0 is ~99% of full-precision quality. Q4_K_M is ~92%. The 7-point gap is invisible in chat, coding, and summarization β three tasks that cover 95% of local LLM use. Q8_0 only pulls ahead on long-form factual recall, multi-step math, and code requiring exact syntax over 500+ lines.
Q4_K_M is the right default because the extra quality from Q8_0 only shows up in edge cases: long-form generation with exact factual recall, or mathematical reasoning that requires higher precision. For everything else, Q4_K_M matches Q8_0 in practice.
If you are already using Q4_K_M and your results feel wrong, the issue is almost never the quantization β it is the model size or prompt structure.
The table below compares Q4_K_M and Q8_0 for a 7B model. Both formats work with Ollama, LM Studio, and llama.cpp without any special configuration.
For context on what Q4_K_M means and how k-quant compression works, see the Q4_K_M explained guide. For the full quantization reference, see quantization levels compared.
Three tasks reveal Q4_K_M's quality gap: long-document recall (50+ pages), multi-step math with intermediate state, and code generation over 300+ lines. For these, Q8_0's extra precision prevents the small drift errors that compound across long outputs. For everything else β chat, code under 200 lines, Q&A, summarization β the gap is invisible. For a refresher before deciding, see what Q4_K_M means.
| Metric | Q4_K_M | Q8_0 |
|---|---|---|
| File size (7B model) | ~4.1 GB | ~7.7 GB |
| VRAM needed (7B) | 5β6 GB | 8β9 GB |
| Quality vs full precision | ~92% | ~99% |
| Best for | 6β8 GB VRAM | 12+ GB VRAM |
ollama pull llama3:8b-q4_K_M and ollama pull llama3:8b-q8_0 are separate downloads. You switch by specifying the tag in ollama run.