Quick Answer
Q4_K_M is the sweet spot β 7B/8B models at Q4_K_M use 4.7β4.9 GB, leaving 1.1 GB for the KV-cache. Q5_K_M fits but requires limiting context to 2k tokens. Avoid Q6_K and above on 6 GB cards.
Updated: 2026-05
Key Takeaways
Quantization level directly controls how much VRAM a model occupies. For 7B and 8B parameter models β the largest class that fits a 6 GB GPU β the practical options are Q3_K_M through Q5_K_M. Q2_K fits but degrades quality below useful levels; Q6_K and above exceed the 6 GB ceiling.
Q4_K_M is the recommended default: a 7B model uses approximately 4.7 GB and an 8B model uses 4.9 GB at this quantization. This leaves 1.1 GB for the KV-cache, which Ollama allocates for the context window. At the default 2048-token context, this is sufficient. Increasing context to 4096 tokens requires approximately 0.5 GB additional KV-cache on a 7B model β still within budget on most 6 GB cards.
Q5_K_M is the next step up. An 8B model at Q5_K_M uses approximately 5.7 GB, leaving only 300 MB free. This is enough for very short contexts (512β2048 tokens) but will cause OOM errors with longer conversations or system prompts. Use Q5_K_M only if you keep `num_ctx` at 2048 or below.
| Quantization | 7B VRAM | 8B VRAM | Fits 6 GB? | Max Context (approx) |
|---|---|---|---|---|
| Q2_K | ~2.8 GB | ~3.0 GB | β (quality poor) | 8k+ |
| Q3_K_M | ~3.5 GB | ~3.7 GB | β (acceptable) | 8k+ |
| Q4_K_M | ~4.7 GB | ~4.9 GB | β recommended | 4k |
| Q5_K_M | ~5.5 GB | ~5.7 GB | β tight (2k ctx only) | 2k |
| Q6_K | ~6.4 GB | ~6.6 GB | β OOM | β |
| Q8_0 | ~7.5 GB | ~7.7 GB | β OOM | β |
Three 7B/8B models stand out at Q4_K_M on a 6 GB card. Qwen 2.5 7B Instruct is the best all-rounder β strong coding (HumanEval ~60%), multilingual support, and 128k context architecture (though you will run at 4k due to VRAM). Run it with `ollama run qwen2.5:7b`.
Llama 3.1 8B is the fastest option. At Q4_K_M it runs at approximately 25 tokens per second on an RTX 3060 6 GB and handles general chat and instruction-following reliably. MMLU score of 66.6% is lower than Qwen 2.5 7B but the speed advantage makes it the better pick for interactive sessions.
Phi-4 Mini (3.8B) is the wild card. At Q8_0 it fits in approximately 4.1 GB β comfortably within 6 GB β and punches above its weight on reasoning benchmarks relative to its size. Use it when you need sub-5 GB footprint with better reasoning than older 7B models. Run with `ollama run phi4-mini`.
Do not attempt 14B models on 6 GB. Qwen 2.5 14B at Q4_K_M requires 9.3 GB. Q2_K brings it to approximately 5.5 GB but the perplexity penalty is severe β the model produces noticeably degraded output. Stick to 7B/8B at Q4_K_M or 3B/4B at Q8_0.