Quick Answer
Qwen 2.5 Coder 14B Q4_K_M is the best coding model for 12 GB VRAM GPUs like the RTX 3060. It uses ~10 GB VRAM and scores highest on HumanEval among models that fit this constraint. DeepSeek Coder 14B is a strong alternative.
Updated: 2026-05
Key Takeaways
As of May 2026, a 14B coding model at Q4_K_M uses 9β10 GB VRAM β making 12 GB the minimum tier that reliably fits the highest-scoring coding models. 8 GB cards cap at 7B models, which score noticeably lower on HumanEval than their 14B counterparts.
Qwen 2.5 Coder 14B Q4_K_M is the top pick β it leads consistently on Python and TypeScript tasks. DeepSeek Coder 14B is a close alternative for polyglot work across 80+ languages. Both use ~9β10 GB VRAM and run at ~14 tok/s on an RTX 3060 12 GB. The RTX 3080 Ti 12 GB pushes these to ~18 tok/s thanks to its higher memory bandwidth (912 GB/s vs 360 GB/s).
If you are working with an 8 GB rig and need a 14B model without upgrading, see the best LLMs for AMD 5700X + RTX 3070 Ti for the 8 GB compromise options.
| Model | VRAM at Q4 | Best For |
|---|---|---|
| Qwen 2.5 Coder 14B Q4_K_M | ~9β10 GB | Python, TypeScript, Go (top pick) |
| DeepSeek Coder 14B Q4_K_M | ~9β10 GB | 80+ languages, polyglot work |
| StarCoder2 15B Q4 | ~9.5 GB | Open-source contribution, code search |
| Llama 3 8B Q5_K_M | ~6 GB | Lighter fallback if 14B feels slow |
Set context to 8k minimum for coding work β the default 2048-token context truncates most source files above ~200 lines. A 14B model at Q4_K_M uses approximately 11.5 GB VRAM at 8k context, which still fits within 12 GB. Use --num-ctx 8192 or set OLLAMA_NUM_CTX=8192 in your environment.
Enable Flash Attention (OLLAMA_FLASH_ATTENTION=1) to reduce the KV cache VRAM footprint by roughly 30%, giving headroom for even longer context at the same 12 GB budget. Both environment variables can be combined in a single launch.
For a full breakdown of which 12 GB GPUs deliver the best coding inference and which models to pair with each, see the best local LLMs for coding guide.
ollama pull qwen2.5-coder:14b-instruct-q4_K_M
ollama run qwen2.5-coder:14b-instruct-q4_K_M/api/generate endpoint using the suffix field. Both run FIM within the normal VRAM budget on 12 GB cards.