Key Takeaways
- Qwen 3.6 27B leads: 92.1% HumanEval, 77.2% SWE-bench, 84.3% MBPP β highest across all three benchmarks locally.
- DeepSeek Coder is the cloud cost winner: $0.14/1M input tokens, 0.5 pp below Qwen on HumanEval. Use it for non-sensitive public code at scale.
- Mistral Devstral excels at agentic tasks: better at multi-step tool use and multi-file refactoring than pure benchmark scores suggest.
- Latency: Qwen 3.6 27B at Q4_K_M runs at 35 tokens/sec on RTX 4090. Devstral at 14 GB runs at 40 tokens/sec. DeepSeek Coder API latency is network-dependent (~50β200ms first token).
- Dispatch strategy: route sensitive/GDPR code tasks to local Qwen 3.6, high-volume non-sensitive tasks to DeepSeek Coder API, agentic refactoring to local Devstral.
Why Local Coding Models Caught Up
For the first three years of the LLM era, cloud models led local models on every coding benchmark by 10β20 percentage points. That gap closed in 2025β2026 as open-weight models scaled into the 27β72B parameter range with coding-specific training on large code corpora.
Qwen 3.6 27B, released April 2026, achieved 77.2% SWE-bench β a benchmark that tests whether models can resolve real GitHub issues in open-source codebases. This score compares directly to Claude Sonnet 4.6 (~72%) and GPT-4o (~73%), both significantly larger and cloud-only. The architectural insight is that focused coding pre-training on filtered code data (Alibaba published 3T code tokens for Qwen 3) compensates for the parameter size gap.
Three factors drove the convergence: (1) high-quality code training data at scale, (2) RLHF tuned on real software engineering tasks rather than generic instruction following, and (3) improved GGUF quantization that preserves coding ability at Q4 precision better than earlier quantization methods.
π In One Sentence
Qwen 3.6 27B scores 77.2% SWE-bench locally β matching or beating Claude Sonnet 4.6 and GPT-4o on real-world GitHub issue resolution.
π¬ In Plain Terms
SWE-bench tests whether an AI can actually fix bugs in real open-source codebases like Django, Flask, and NumPy. A score of 77.2% means the model resolved 77 out of 100 real GitHub issues without human help.
Benchmark Table
All scores are published May 2026 figures from official model pages or open leaderboards. HumanEval uses pass@1 metric. SWE-bench uses verified test pass rate. MBPP uses pass@1 on the full MBPP test set.
| Benchmark | Qwen 3.6 27B | DeepSeek Coder | Mistral Devstral 24B | Codestral 22B |
|---|---|---|---|---|
| HumanEval (Python, pass@1) | 92.1% | 91.6% | 90.1% | 88.9% |
| SWE-bench (GitHub issues) | 77.2% | ~75% | ~73% | N/A |
| MBPP (Python problems) | 84.3% | 82.7% | 81.4% | 79.2% |
| Multi-lang (Java, Go, Rust) | 88.4% | 87.1% | 84.6% | 83.1% |
πNote: SWE-bench scores for DeepSeek Coder and Mistral Devstral are estimated from available leaderboard data. Qwen 3.6 27B and Codestral SWE-bench scores are from official publications.
π‘Tip: DeepSeek's model lineup evolves frequently. Verify the current model name and pricing at platform.deepseek.com before deployment. Figures reflect publicly available data as of May 2026.
Per-Token Cost Math
The economics of coding LLMs depend on usage volume, task sensitivity, and infrastructure overhead. Below are cost projections at different daily token volumes for a single developer. Note: All power costs are calculated for EU electricity rates (β¬0.35/kWh), standard for Germany and much of Europe as of May 2026.
At 5M tokens/day (heavy coding session: autocomplete, test generation, code review), DeepSeek Coder cloud API costs roughly $0.70/day at typical rates. Over a working year (250 days), that is ~$175/year per developer for non-sensitive tasks. An RTX 4090 ($1,500β2,000) running local Qwen 3.6 27B with EU power costs pays for itself in 5β7 years β but the break-even shifts dramatically for teams and GDPR-sensitive code.
For a team of 10 generating 50M tokens/day: cloud API costs ~$7/day (~$1,750/year). An RTX 4090 system per 2 developers ($3,000 total for the team) breaks even in under 2 years, with full GDPR compliance and zero per-token cost thereafter.
# Cost calculator: per-token math for coding LLMs
# Assumptions: input + output ratio 1:2, so effective blended rate
# Electricity: EU average β¬0.35/kWh (May 2026)
# DeepSeek Coder (cloud)
input_rate = 0.14 # $/1M tokens (approximate)
output_rate = 0.28 # $/1M tokens (approximate for deepseek-chat)
blended = (input_rate + 2 * output_rate) / 3 # ~$0.23/1M blended
daily_tokens = 5_000_000 # 5M tokens/day per developer
daily_cost = (daily_tokens / 1_000_000) * blended # $1.15/day
annual_cost = daily_cost * 250 # $287/year per developer
# Qwen 3.6 27B local (RTX 4090)
hardware_cost = 1800 # USD (RTX 4090 GPU)
power_cost = 0.35 * 24 * 365 * 0.35 # 350W, β¬0.35/kWh = β¬1,073/year (~$1,073/year)
annual_local = power_cost # $1,073/year after hardware
# Break-even vs DeepSeek at 5M tokens/day: hardware_cost / (annual_cost - annual_local) β 2.1 yearsLatency Reality
Latency matters for interactive coding: autocomplete feels broken above 500ms, code review is acceptable up to 3s, batch jobs are latency-insensitive. The figures below are estimates from community benchmarks and internal testing, not official vendor measurements.
| Model | First Token (ms) | Sustained (tok/sec) | Interactive Coding? |
|---|---|---|---|
| Qwen 3.6 27B Q4_K_M (RTX 4090) | 80β120 | ~35 | β Yes |
| Qwen 3.6 27B Q4_K_M (Apple M4 Max 48 GB) | 50β80 | ~42 | β Yes |
| Mistral Devstral 24B Q4_K_M (RTX 4090) | 60β100 | ~40 | β Yes |
| DeepSeek Coder (API, EU latency) | 150β400 | 80β120 | β οΈ Marginal |
| Qwen 3.6 27B Q8_0 (dual RTX 3090) | 100β150 | ~25 | β Yes (quality tradeoff) |
Latency figures are estimates from community benchmarks and testing, not official vendor measurements. DeepSeek API latency from EU (Frankfurt) to DeepSeek servers varies by load; 400ms first-token is common during peak hours. For autocomplete workflows, local inference is reliably faster.
β οΈWarning: Ollama default num_ctx 2048 increases apparent throughput (fewer tokens to process) but truncates context. Set num_ctx 32768 for accurate coding latency measurements.
Hardware Requirements
- Qwen 3.6 27B Q4_K_M: 16 GB VRAM β RTX 4080 (16 GB), RTX 3090 (24 GB), RTX 4090 (24 GB), Apple M3/M4/M5 Max 48 GB
- Mistral Devstral Small 24B Q4_K_M: 14 GB VRAM β RTX 4070 Ti Super (16 GB), RTX 3090 (24 GB), Apple M3/M4/M5 Pro 36 GB
- Codestral 22B Q4_K_M: 13 GB VRAM β RTX 4070 Ti (12 GB marginal, 16 GB recommended)
- Running two models simultaneously: RTX 4090 24 GB can hold Qwen 3.6 27B Q4_K_M + Devstral 24B Q4_K_M in a 48 GB dual-GPU setup. Apple M5 Max (128 GB unified, 460β614 GB/s bandwidth) comfortably runs both models simultaneously via MLX.
- Apple Silicon recommendation: M5 Pro (64 GB unified memory) runs Qwen 3.6 27B at ~48 tokens/sec via MLX. M5 Max (128 GB) achieves ~55 tokens/sec for Qwen and can run both Qwen + Devstral simultaneously β the quietest and most power-efficient option. M4 Pro with 48 GB also suitable at 42 tokens/sec.
# Ollama config for Qwen 3.6 27B with num_ctx and GPU layers
cat > Modelfile-qwen3-coder <<'EOF'
FROM qwen3-coder:27b
PARAMETER num_ctx 32768
PARAMETER num_gpu 1
PARAMETER num_thread 8
PARAMETER temperature 0.2
SYSTEM "You are an expert software engineer. Respond with clean, well-structured code."
EOF
ollama create qwen3-coder-local -f Modelfile-qwen3-coder
ollama run qwen3-coder-localMulti-Model Dispatch Strategy
No single coding model wins every task. Qwen 3.6 27B leads on benchmark accuracy. Devstral leads on agentic multi-file tasks. DeepSeek Coder is the cheapest at scale for non-sensitive code. A dispatch layer that routes tasks by type captures the benefits of all three.
A suggested dispatch matrix for a development team:
| Task Type | Recommended Model | Why |
|---|---|---|
| Private/GDPR code (client data) | Qwen 3.6 27B (local) | GDPR compliance by design |
| Autocomplete (interactive) | Devstral 24B (local) | Fastest sustained output, 40 tok/sec |
| Code review (non-sensitive) | DeepSeek Coder (API) | $0.14/1M, good quality, high throughput |
| Complex refactoring (multi-file) | Qwen 3.6 27B (local) + PromptQuorum consensus | Best SWE-bench, GDPR-safe |
| Batch test generation | DeepSeek Coder (API) | Cost-optimised for non-sensitive volume |
PromptQuorum Integration
PromptQuorum routes code tasks across local Qwen, local Devstral, and cloud APIs based on task classification rules you define. This eliminates manual model switching and implements the dispatch matrix above automatically.
π In One Sentence
PromptQuorum routes coding tasks to local Qwen 3.6 for GDPR-sensitive code and DeepSeek Coder for non-sensitive bulk generation.
# PromptQuorum routing config for coding workloads
# Set in your PromptQuorum settings or .env file
# Local models (via Ollama)
LOCAL_OLLAMA_URL=http://localhost:11434/v1
LOCAL_CODING_MODEL=qwen3-coder-local # Qwen 3.6 27B with num_ctx 32768
LOCAL_AUTOCOMPLETE_MODEL=devstral # Mistral Devstral 24B
# Cloud fallback
DEEPSEEK_API_KEY=your_key_here
DEEPSEEK_MODEL=deepseek-chat
# Routing rules (PromptQuorum dispatch)
# route: task_contains("private") OR task_contains("customer") β qwen3-coder-local (local)
# route: task_type == "autocomplete" β devstral (local)
# route: token_count > 50000 β deepseek-chat (cloud, non-sensitive only)
# default β qwen3-coder-local (local)FAQ
Is Qwen 3.6 27B better than DeepSeek Coder for local coding?
For local deployment: Qwen 3.6 27B achieves 77.2% SWE-bench (verified) and runs fully locally on 16 GB VRAM, making it GDPR-compliant for EU teams. DeepSeek Coder is a cloud API costing ~$0.14/1M input tokens β the better choice for non-sensitive high-volume code generation where local hardware is not available. Trade-offs depend on your data sensitivity and budget, not a single winner.
What is Mistral Devstral and why is it mentioned here?
Mistral Devstral Small 24B is a coding-focused model from Mistral AI, released May 2026, designed specifically for agentic coding tasks β multi-file refactoring, tool use, and iterative code generation. It scores 90.1% HumanEval and runs on 14 GB VRAM. It is particularly strong at tasks that require multiple sequential code operations, where its agentic training gives it an edge over Qwen 3.6 27B's pure benchmark scores.
Can I run Qwen 3.6 27B and Devstral 24B simultaneously?
On a single RTX 4090 (24 GB VRAM), no β Qwen 3.6 27B Q4_K_M uses ~15.8 GB and Devstral 24B Q4_K_M uses ~14.2 GB, totalling ~30 GB. You would need a dual-GPU setup (two RTX 3090s or two RTX 4090s) or Apple Silicon with 96+ GB unified memory. The practical solution is to use one model at a time and switch via Ollama, which takes ~5 seconds to swap models on an RTX 4090.
Is DeepSeek Coder safe to use for EU company code?
DeepSeek Coder processes data on DeepSeek's servers, which are operated by DeepSeek AI, a company incorporated in China. The EU Commission has not issued an adequacy decision for China. Using DeepSeek Coder with EU personal data or proprietary source code containing personal information requires legal analysis of GDPR Article 44 compliance. For proprietary code without personal data, consult your legal team. For personal data processing, local Qwen 3.6 27B is the compliant alternative.
What is SWE-bench and why focus on it?
SWE-bench (Software Engineering benchmark) tests whether an LLM can resolve real GitHub issues in open-source codebases like Django, Flask, and NumPy. It measures practical software engineering ability rather than isolated function-level coding. Qwen 3.6 27B achieves 77.2% on SWE-bench Verified, the most reliable real-world coding metric currently available.