Quick Answer
With 12 GB VRAM, the best general model is Llama 3 8B at Q5_K_M. For coding, use Qwen 2.5 Coder 14B at Q4_K_M. Both run at 20β30 tokens per second.
Updated: 2026-05
Key Takeaways
As of May 2026, the RTX 3060 12 GB is the cheapest path to running 14B models locally. Its 12 GB VRAM matches the RTX 4070 Ti (~$800) and RTX 4080 (~$1,100) at a fraction of the cost. For a $280β$350 used card, you get the same model capacity as cards costing 3Γ more β limited only by raw speed, not what you can load.
All five models below run with Ollama out of the box. Speed figures are at default 2048-token context on a desktop PC with no CPU offload.
| Model | VRAM Used | Speed |
|---|---|---|
| Llama 3 8B Q5_K_M | 7.0 GB | ~25 tok/s |
| Qwen 2.5 Coder 14B Q4_K_M | 10.0 GB | ~20 tok/s |
| Mistral 7B Q6_K | 6.5 GB | ~27 tok/s |
| Phi-4 Q5_K_M | 6.2 GB | ~28 tok/s |
| Qwen 14B Q4_K_M | 10.0 GB | ~18 tok/s |
For the general-use pick, run Llama 3 8B at Q5_K_M with a 4096-token context window. This uses ~8 GB VRAM total and leaves 4 GB of headroom β enough to avoid VRAM overflow when switching between models.
For coding, Qwen 2.5 Coder 14B at Q4_K_M is the clear choice: it outperforms Llama 3 8B on HumanEval, fits in 10 GB VRAM, and handles Python, TypeScript, and Go without fine-tuning.
Leave at least 1.5β2 GB of VRAM free at all times. Loading two models back-to-back without unloading the first triggers VRAM overflow and forces slow CPU offload. For the full GPU benchmark context, see the best GPUs for local LLMs. If your GPU has less than 12 GB, see the best models for 6 GB VRAM. To run the top general-purpose pick on your RTX 3060:
ollama pull llama3:8b-instruct-q5_K_M
ollama run llama3:8b-instruct-q5_K_M--num-ctx 4096 if you need a larger context window.ollama run modelname and it loads entirely to GPU if VRAM is sufficient.