Best Local LLM for Coding with 12 GB VRAM?

Read in:

🇺🇸en 🇩🇪de 🇫🇷fr 🇯🇵ja 🇨🇳zh 🇪🇸es 🇧🇷pt 🇸🇦ar 🇰🇷ko

Quick Answer

Qwen 3 Coder 14B Q4_K_M is the best coding model for 12 GB VRAM GPUs, achieving the highest HumanEval scores among 14B models while using ~10 GB VRAM on RTX 3060 and RTX 3080 Ti. It uses ~10 GB VRAM and scores highest on HumanEval among models that fit this constraint. DeepSeek Coder 14B is a strong alternative.

▸Qwen 3 Coder 14B Q4_K_M: ~10 GB VRAM, top coding benchmark for this size
▸DeepSeek Coder 14B Q4_K_M: similar VRAM, competitive on code completion
▸Both fit RTX 3060 12 GB and RTX 3080 Ti 12 GB

Updated: 2026-05

Hardware-SpecificIntermediate

Key Takeaways

✓Qwen 3 Coder 14B Q4_K_M: ~9–10 GB VRAM, high-70s on HumanEval — the top coding pick for any 12 GB GPU
✓12 GB is the VRAM threshold that unlocks the 14B coding tier; 8 GB cards cap at 7B models, which score noticeably lower on coding benchmarks
✓Set --num-ctx 8192 minimum for coding work — the default 2048-token context truncates most real source files
✓NVIDIA 12 GB cards deliver ~22 tok/s on these models; AMD 12 GB cards with ROCm at ~16 tok/s

The Coding Pick for 12 GB

As of May 2026, a 14B coding model at Q4_K_M uses 9–10 GB VRAM — making 12 GB the minimum tier that reliably fits the highest-scoring coding models. 8 GB cards cap at 7B models, which score noticeably lower on HumanEval than their 14B counterparts.

Qwen 3 Coder 14B Q4_K_M is the top pick — it leads consistently on Python and TypeScript tasks. DeepSeek Coder 14B is a close alternative for polyglot work across 80+ languages. Both use ~9–10 GB VRAM and run at ~14 tok/s on an RTX 3060 12 GB. The RTX 3080 Ti 12 GB pushes these to ~18 tok/s thanks to its higher memory bandwidth (912 GB/s vs 360 GB/s).

If you are working with an 8 GB rig and need a 14B model without upgrading, see the best LLMs for AMD 5700X + RTX 3070 Ti for the 8 GB compromise options.

Model	VRAM at Q4	Best For
Qwen 3 Coder 14B Q4_K_M	~9–10 GB	Python, TypeScript, Go (top pick)
DeepSeek Coder 14B Q4_K_M	~9–10 GB	80+ languages, polyglot work
StarCoder2 15B Q4	~9.5 GB	Open-source contribution, code search
Llama 3 8B Q5_K_M	~6 GB	Lighter fallback if 14B feels slow

Settings That Matter for Coding

Set context to 8k minimum for coding work — the default 2048-token context truncates most source files above ~200 lines. A 14B model at Q4_K_M uses approximately 11.5 GB VRAM at 8k context, which still fits within 12 GB. Use --num-ctx 8192 or set OLLAMA_NUM_CTX=8192 in your environment.

Enable Flash Attention (OLLAMA_FLASH_ATTENTION=1) to reduce the KV cache VRAM footprint by roughly 30%, giving headroom for even longer context at the same 12 GB budget. Both environment variables can be combined in a single launch.

For a full breakdown of which 12 GB GPUs deliver the best coding inference and which models to pair with each, see the best local LLMs for coding guide.

ollama pull qwen2.5-coder:14b-instruct-q4_K_M
ollama run qwen2.5-coder:14b-instruct-q4_K_M

AMD Radeon 12 GB cards (RX 6800 XT, RX 6700 XT) with ROCm run these models at ~16 tok/s — roughly 30% slower than CUDA on an equivalent NVIDIA 12 GB card. For mobile AMD GPUs (e.g., Radeon 6800M), see the dedicated mobile guide for thermal and battery considerations.

Related Guides

▸Best MoE Models for Local Coding -- MoE coding models
▸Cursor Pro vs Continue.dev: Which AI Coding Tool? -- coding tool comparison

Quick Answers About Coding LLMs for 12 GB VRAM

Does Qwen 3 Coder fit on RTX 4060 Ti 8 GB?▾

No — Qwen 3 Coder 14B at Q4_K_M uses ~9–10 GB, which exceeds 8 GB VRAM. Dropping to Q3_K_M (~7 GB) allows it to fit, but output quality degrades noticeably on code completion tasks.

Should I use Q4_K_M or Q5_K_M for coding on 12 GB?▾

Q4_K_M for 14B models — required to stay within 12 GB. Q5_K_M for 7B–8B models where you have extra VRAM headroom; it preserves more model fidelity with no VRAM risk on 12 GB cards.

Which is better for code review: 14B coding model or 8B general-purpose?▾

The 14B coding-specific model — by a decisive margin. Coding-tuned 14B models score substantially higher on HumanEval than general-purpose models of similar size, reflecting coding-specific pretraining data rather than just parameter count.

Can I use fill-in-the-middle (FIM) with these models on Ollama?▾

Yes. Qwen 3 Coder and DeepSeek Coder both support FIM natively. Ollama exposes it via the /api/generate endpoint using the suffix field. Both run FIM within the normal VRAM budget on 12 GB cards.

Want the full breakdown?

Read the complete guide →

Related Prompt Bites

← Back to Prompt Bites