PromptQuorumPromptQuorum

Best Local LLM for Coding with 12 GB VRAM?

Quick Answer

Qwen 2.5 Coder 14B Q4_K_M is the best coding model for 12 GB VRAM GPUs like the RTX 3060. It uses ~10 GB VRAM and scores highest on HumanEval among models that fit this constraint. DeepSeek Coder 14B is a strong alternative.

  • β–ΈQwen 2.5 Coder 14B Q4_K_M: ~10 GB VRAM, top coding benchmark for this size
  • β–ΈDeepSeek Coder 14B Q4_K_M: similar VRAM, competitive on code completion
  • β–ΈBoth fit RTX 3060 12 GB and RTX 3080 Ti 12 GB

Updated: 2026-05

Hardware-SpecificIntermediate

Key Takeaways

  • βœ“Qwen 2.5 Coder 14B Q4_K_M: ~9–10 GB VRAM, high-70s on HumanEval β€” the top coding pick for any 12 GB GPU
  • βœ“12 GB is the VRAM threshold that unlocks the 14B coding tier; 8 GB cards cap at 7B models, which score noticeably lower on coding benchmarks
  • βœ“Set --num-ctx 8192 minimum for coding work β€” the default 2048-token context truncates most real source files
  • βœ“NVIDIA 12 GB cards deliver ~22 tok/s on these models; AMD 12 GB cards with ROCm at ~16 tok/s

The Coding Pick for 12 GB

As of May 2026, a 14B coding model at Q4_K_M uses 9–10 GB VRAM β€” making 12 GB the minimum tier that reliably fits the highest-scoring coding models. 8 GB cards cap at 7B models, which score noticeably lower on HumanEval than their 14B counterparts.

Qwen 2.5 Coder 14B Q4_K_M is the top pick β€” it leads consistently on Python and TypeScript tasks. DeepSeek Coder 14B is a close alternative for polyglot work across 80+ languages. Both use ~9–10 GB VRAM and run at ~14 tok/s on an RTX 3060 12 GB. The RTX 3080 Ti 12 GB pushes these to ~18 tok/s thanks to its higher memory bandwidth (912 GB/s vs 360 GB/s).

If you are working with an 8 GB rig and need a 14B model without upgrading, see the best LLMs for AMD 5700X + RTX 3070 Ti for the 8 GB compromise options.

ModelVRAM at Q4Best For
Qwen 2.5 Coder 14B Q4_K_M~9–10 GBPython, TypeScript, Go (top pick)
DeepSeek Coder 14B Q4_K_M~9–10 GB80+ languages, polyglot work
StarCoder2 15B Q4~9.5 GBOpen-source contribution, code search
Llama 3 8B Q5_K_M~6 GBLighter fallback if 14B feels slow

Settings That Matter for Coding

Set context to 8k minimum for coding work β€” the default 2048-token context truncates most source files above ~200 lines. A 14B model at Q4_K_M uses approximately 11.5 GB VRAM at 8k context, which still fits within 12 GB. Use --num-ctx 8192 or set OLLAMA_NUM_CTX=8192 in your environment.

Enable Flash Attention (OLLAMA_FLASH_ATTENTION=1) to reduce the KV cache VRAM footprint by roughly 30%, giving headroom for even longer context at the same 12 GB budget. Both environment variables can be combined in a single launch.

For a full breakdown of which 12 GB GPUs deliver the best coding inference and which models to pair with each, see the best local LLMs for coding guide.

ollama pull qwen2.5-coder:14b-instruct-q4_K_M
ollama run qwen2.5-coder:14b-instruct-q4_K_M
AMD Radeon 12 GB cards (RX 6800 XT, RX 6700 XT) with ROCm run these models at ~16 tok/s β€” roughly 30% slower than CUDA on an equivalent NVIDIA 12 GB card. For mobile AMD GPUs (e.g., Radeon 6800M), see the dedicated mobile guide for thermal and battery considerations.

Quick Answers About Coding LLMs for 12 GB VRAM

Does Qwen 2.5 Coder fit on RTX 4060 Ti 8 GB?β–Ύ
No β€” Qwen 2.5 Coder 14B at Q4_K_M uses ~9–10 GB, which exceeds 8 GB VRAM. Dropping to Q3_K_M (~7 GB) allows it to fit, but output quality degrades noticeably on code completion tasks.
Should I use Q4_K_M or Q5_K_M for coding on 12 GB?β–Ύ
Q4_K_M for 14B models β€” required to stay within 12 GB. Q5_K_M for 7B–8B models where you have extra VRAM headroom; it preserves more model fidelity with no VRAM risk on 12 GB cards.
Which is better for code review: 14B coding model or 8B general-purpose?β–Ύ
The 14B coding-specific model β€” by a decisive margin. Coding-tuned 14B models score substantially higher on HumanEval than general-purpose models of similar size, reflecting coding-specific pretraining data rather than just parameter count.
Can I use fill-in-the-middle (FIM) with these models on Ollama?β–Ύ
Yes. Qwen 2.5 Coder and DeepSeek Coder both support FIM natively. Ollama exposes it via the /api/generate endpoint using the suffix field. Both run FIM within the normal VRAM budget on 12 GB cards.