Key Takeaways
- VRAM math: (Model size in GB) Γ· Quantization = VRAM needed. Example: 70B at Q4 = 70 Γ· 8 = 8.75 GB Γ parameters β 39 GB total.
- 12 GB VRAM (RTX 4070 Ti): Best model: Llama 4 Scout 17B Q4_K_M (~10 GB, MoE, best overall quality). Also: Llama 3.1 8B Q8 (~9 GB, 80 tok/sec).
- 16 GB VRAM (RTX 4080 / RTX 5080): Best model: Mistral Small 3.1 24B Q4_K_M (~13 GB, 55 tok/sec). Also: Devstral Small 24B Q4_K_M for agentic coding.
- 24 GB VRAM (RTX 4090): Most 70B models at Q4_K_M (39 GB) do NOT fit. Best option: DeepSeek-R1 32B Q4_K_M (~19 GB, 60 tok/sec) or Qwen 3.6 27B (~16 GB, 77.2% SWE-bench).
- CPU-only (16 GB system RAM): Llama 3.2 3B Q8 (20 tok/sec) or Phi-4 Mini Q4_K_M (25 tok/sec). A used RTX 4060 8GB (~$150) or RTX 5060 Ti 12GB (~$250) is 5-10Γ faster.
- Apple M5 Max (128 GB unified): First Mac to run 70B models at Q4_K_M β comparable to dual RTX 4090 desktops in a laptop or Mac Studio.
- llama.cpp speed tip: Always set `--n-gpu-layers 99`. This alone doubles speed on RTX 4070 Ti from ~40 to ~85 tok/sec.
- Quick reference: 7B@Q4_K_M = 4.7 GB | 70B@Q4_K_M = 40 GB | RTX 4070 Ti = ~80 tok/s | RTX 4090 = ~150 tok/s | CPU-only 16 GB = 12-28 tok/s
Best GPUs to Buy β 2026 Recommendations
Choosing a GPU depends on your budget and the model size you want to run. The NVIDIA RTX 40-series (4060, 4070 Ti, 4090) and RTX 50-series (5060 Ti, 5080) dominate for local LLMs in 2026. Here are the top recommendations by use case:
- For 7B Models (Mistral, Phi-4, Llama 3.2) β Budget: RTX 4060 (8 GB VRAM, ~$180β220). Runs any 7B model at Q4_K_M. Speed: 40β60 tok/sec. Tier: Budget enthusiasts.
- For 14B Models (Llama 3.1, DeepSeek-R1) β Mainstream: RTX 4070 Ti (12 GB VRAM, ~$500β600). Best price-to-performance. Llama 4 Scout 17B Q4 runs well. Speed: 85β120 tok/sec. Tier: Most popular.
- For 33B Models (Qwen2.5, Mistral Small) β Mid-Range: RTX 4080 or RTX 5080 (16 GB VRAM, ~$1000β1200). Runs Devstral Small 24B Q4_K_M. Speed: 110β140 tok/sec. Tier: Professional developers.
- For 70B Models (Llama 3.3, Qwen 3.6) β High-End: RTX 4090 (24 GB VRAM, ~$1700β2000). Runs 70B at Q3_K_M (~25 GB). For Q4_K_M (40 GB), use dual RTX 4090. Speed: 150β180 tok/sec single GPU. Tier: Research + production.
- Best Value 2026: RTX 4070 Ti + RTX 5060 Ti 12GB combo (~$750 total) β runs 70B at Q3 and 14B at Q4 simultaneously.
- For Apple Users: Mac M5 Max (128 GB unified memory) first Mac running true 70B models. ~$6000. Equivalent performance to dual RTX 4090 setup.
| GPU | Best For | Price | Speed | Tier |
|---|---|---|---|---|
| RTX 4060 (8 GB) | 7B models | ~$180β220 | 40β60 tok/s | Budget |
| RTX 4070 Ti (12 GB) | 14B models | ~$500β600 | 85β120 tok/s | Mainstream |
| RTX 4080 / RTX 5080 (16 GB) | 33B models | ~$1000β1200 | 110β140 tok/s | Professional |
| RTX 4090 (24 GB) | 70B (Q3) | ~$1700β2000 | 150β180 tok/s | High-end |
| Dual RTX 4090 | 70B (Q4) | ~$3400β4000 | 280β360 tok/s | Enterprise |
| Mac M5 Max 128GB | 70B (Q4) | ~$6000 | 120β160 tok/s | Pro laptop |
How Do You Calculate VRAM Requirements?
VRAM requirements depend on three factors: model size (parameters), quantization (bits per weight), and inference mode. Use this formula to determine if your GPU has enough memory. For an interactive calculator, see the VRAM calculator for local LLMs.
Formula:
```text VRAM (GB) = (Model Size Γ Quantization Bits) Γ· 8 ```
Quantization values: FP16 = 16 bits, Q8_0 = 8 bits, Q5_K_M = 5 bits, Q4_K_M = 4 bits. The practical sweet spot is Q4_K_M -- it uses 4-bit weights with K-quantization, which NVIDIA GPUs accelerate more efficiently than the older Q4_0 format.
| Model | FP16 | Q8_0 | Q5_K_M | Q4_K_M |
|---|---|---|---|---|
| Llama 4 Scout 17B (active) | ~34 GB | ~18 GB | ~12 GB | ~10 GB |
| Llama 3.1 8B | 16 GB | 8.5 GB | 5.7 GB | 4.7 GB |
| Qwen 3.6 27B | ~54 GB | ~28 GB | ~19 GB | ~16 GB |
| Qwen3 8B | ~16 GB | ~8.5 GB | ~5.7 GB | ~5 GB |
| Llama 3.3 70B | 140 GB | 70 GB | 48 GB | 40 GB |
| Qwen2.5 32B | 64 GB | 33 GB | 22 GB | 19 GB |
| Mistral Small 3.1 24B | 48 GB | 25 GB | 17 GB | 14 GB |
| Phi-4 Mini 3.8B | 7.6 GB | 4.1 GB | 2.7 GB | 2.3 GB |
Q4_K_M is the recommended default for consumer hardware -- 90-95% of FP16 quality at 25-30% of the VRAM cost. Llama 4 Scout uses MoE architecture with 17B active parameters out of 109B total. VRAM is determined by active params for inference, not total params.
β’KeyPoint: In one sentence: VRAM is the GPU's dedicated memory pool -- the single number that determines which AI models you can run locally and at what quality.
KV Cache: The Hidden VRAM Cost
The VRAM formula (Model Size Γ Bits Γ· 8) covers model weights only -- KV cache adds significant additional VRAM that most guides ignore.
The KV cache stores attention state for every token in your context window. It grows linearly with context length and stays in VRAM throughout the session.
KV cache VRAM formula: `KV cache β layers Γ heads Γ head_dim Γ 2 Γ context_length Γ 2 bytes`
| Model | 4K context | 32K context | 128K context |
|---|---|---|---|
| Llama 3.1 8B | 0.5 GB | 4 GB | 16 GB |
| Llama 3.3 70B | 2 GB | 16 GB | 64 GB |
| Qwen2.5 32B | 1 GB | 8 GB | 32 GB |
β’KeyPoint: In one sentence: KV cache is temporary VRAM used to store conversation context -- it grows with every token you generate and is separate from model weight storage.
β οΈWarning: A Llama 3.1 8B at Q4_K_M needs 4.7 GB for weights -- but add a 32K context window and total VRAM rises to ~8.7 GB. On an 8 GB card, this causes OOM errors.
β’KeyPoint: Rule of thumb: Add 25% to model weight size for typical 8K context, 100% for 32K context. Ollama default context is 2,048 tokens. To set higher: PARAMETER num_ctx 32768 in your Modelfile.
Which GPU Tier Matches Your Workload?
As of May 2026, NVIDIA GPUs deliver the highest tokens/sec for local LLM inference across all price points. The sections below each tier give specific model recommendations. For a detailed benchmark comparison, see the best GPUs for local LLM guide.
| Tier | GPU | VRAM | Best For | Speed |
|---|---|---|---|---|
| Budget ($600) | RTX 4070 Ti / RTX 5070 | 12 GB | 7-13B models | ~80 tok/s |
| Mid ($900) | RTX 5070 Ti | 16 GB | 13-30B models | ~100 tok/s |
| High ($1,200) | RTX 4080 / RTX 5080 | 16 GB | 13-30B models | ~120 tok/s |
| Top ($1,800) | RTX 4090 | 24 GB | 32B models, 70B at Q2_K | ~150 tok/s |
| Latest ($2,000) | RTX 5090 | 32 GB | 70B + headroom | ~200 tok/s |
| Server ($3,000+) | RTX 6000 Ada / A100 | 48+ GB | Multi-user, 70B+ | Production |
| Desktop AI ($3,999) | NVIDIA DGX Spark | 128 GB | Any model, unified | 18-28 tok/s |
β’KeyPoint: As of May 2026, the RTX 50-series (Blackwell) is the current generation. RTX 5090 (32 GB) is future-proof for 70B models. RTX 4090 remains excellent value for existing buyers.
Best Local LLMs by VRAM Tier (May 2026)
Use this as a quick lookup by your GPU's VRAM tier:
All models listed below are open-weights β downloadable, fine-tunable, and free to run locally. If you're choosing between open-weights and proprietary APIs, see our open-source vs proprietary LLMs comparison for cost and performance trade-offs at different token volumes.
Hardware determines which models you can run; prompt engineering determines how well they perform. A well-structured prompt on a 7B model often outperforms a lazy prompt on a 70B model. See the complete prompt engineering guide for techniques that maximise output quality at any parameter count.
- 8 GB VRAM (RTX 4060, RTX 5060 Ti, Intel B580): Llama 3.1 8B Q4_K_M (4.7 GB, ~70 tok/s) -- recommended. Qwen3 8B (5 GB, best multilingual + coding). Phi-4 Mini 3.8B (2.3 GB, fastest). Gemma 2 9B (5.5 GB, fits with care). Avoid 13B+ models.
- 12 GB VRAM (RTX 4070 Ti, RTX 5070, Intel B770): Llama 4 Scout 17B Q4_K_M (~10 GB, best overall quality, MoE). Llama 3.1 8B (4.7 GB, fast with headroom). Qwen2.5 14B Q4_K_M (8.5 GB, better reasoning on budget). DeepSeek-R1 8B (5 GB, best reasoning). Avoid 30B+.
- 16 GB VRAM (RTX 4080, RTX 5070 Ti, RTX 5080): Mistral Small 3.1 24B Q4_K_M (14 GB, best quality at tier). Devstral Small 24B Q4_K_M (~16 GB) for agentic coding. Qwen2.5 14B (9 GB, fast with context headroom). Llama 3.3 70B at Q2_K (17 GB, possible but degraded quality).
- 24 GB VRAM (RTX 5090, RTX 4090, Tesla L40): Qwen 3.6 27B Q4_K_M (~16 GB, 77.2% SWE-bench, best dense coding model). DeepSeek-R1 32B Q4_K_M (~19 GB, best reasoning). Qwen2.5 32B Q5_K_M (~21 GB). Llama 3.3 70B needs 2Γ 24 GB GPUs at Q4_K_M.
- 32 GB VRAM (RTX 5090): Llama 3.3 70B Q4_K_M (40 GB -- needs minimal CPU offload for last layers). Kimi K2.6 quantized (MoE, 42B active, MIT license, best coding). Qwen2.5 32B (19 GB, fits entirely with 13 GB spare). RTX 5090 is the first single consumer GPU that fits 70B with minimal offload.
- 48+ GB VRAM (RTX 6000 Ada, A100, DGX Spark): Llama 3.3 70B Q4_K_M (40 GB, fits entirely). Llama 4 Maverick (17B active, 400B total, MoE). Llama 3.3 70B Q8_0 (70 GB -- needs 80 GB A100). NVIDIA DGX Spark (128 GB unified) fits every open-weight model including 70B at Q8_0 with 58 GB to spare.
Which Local LLMs Run Best on 16 GB VRAM?
On a 16 GB VRAM GPU (NVIDIA RTX 4080, RTX 5080, or RTX 4090 laptop), the practical ceiling is 14-24B models. Mistral Small 3.1 24B at Q4_K_M is the best overall choice: it uses 13 GB VRAM, runs at 55 tok/sec, and is EU-origin with Apache 2.0 license.
Devstral Small 24B Q4_K_M fits at ~16 GB and is optimized for agentic coding workflows. The table below shows which models fit and which do not. "Does NOT fit" rows are included intentionally -- this is the most common mistake 16 GB owners make.
| Model | Quantization | VRAM Used | Speed (RTX 4080) | Best For | Fits 16 GB? |
|---|---|---|---|---|---|
| Mistral Small 3.1 24B | Q4_K_M | ~13 GB | 55 tok/sec | General chat | β Yes |
| Devstral Small 24B | Q4_K_M | ~16 GB | 45 tok/sec | Agentic coding | β Tight |
| Qwen2.5 14B | Q8_0 | ~15 GB | 45 tok/sec | Coding + reasoning | β Yes |
| DeepSeek-R1 14B | Q8_0 | ~15 GB | 40 tok/sec | Math + analysis | β Yes |
| Llama 3.1 8B | FP16 | ~16 GB | 70 tok/sec | Fastest responses | β Tight |
| Llama 3.3 70B | Q4_K_M | ~39 GB | -- | -- | β No (needs 39 GB) |
β’ProTip: π Best overall for 16 GB: Mistral Small 3.1 24B Q4_K_M at ~13 GB, 55 tok/sec. For agentic coding, use Devstral Small 24B (Mistral AI, France) at 45 tok/sec. Best reasoning: DeepSeek-R1 14B Q8_0 at 40 tok/sec.
β οΈWarning: RTX 4090 laptop GPUs have 16 GB VRAM (not 24 GB). They share the same model ceiling as the RTX 4080 desktop.
β’KeyPoint: When to upgrade to 24 GB (RTX 4090 desktop): only if you need 32B+ models at Q8, or want to run two models simultaneously without reloading.
Which Local LLMs Run Best on 12 GB VRAM?
On a 12 GB VRAM GPU (NVIDIA RTX 4070 Ti, RTX 5070, or RTX 5060 Ti), you can run 7-8B models at Q8, 14B at Q4_K_M, or the new Llama 4 Scout 17B at Q4_K_M (MoE). Llama 4 Scout uses a Mixture-of-Experts architecture with 17B active parameters out of 109B total -- this makes Scout significantly more VRAM-efficient than its parameter count suggests.
Llama 3.1 8B at Q8_0 is the most reliable choice for conservative setups: 9 GB VRAM, 80 tok/sec, and full instruction-following quality. Qwen2.5 14B at Q4_K_M also fits at ~8.5 GB and delivers notably better reasoning than the 8B tier.
| Model | Quantization | VRAM Used | Speed (RTX 4070 Ti) | Best For | Fits 12 GB? |
|---|---|---|---|---|---|
| Llama 4 Scout 17B | Q4_K_M | ~10 GB | ~65 tok/sec | Best overall (MoE) | β Yes |
| Llama 3.1 8B | Q8_0 | ~9 GB | 80 tok/sec | General chat + coding | β Yes |
| Qwen2.5 14B | Q4_K_M | ~8.5 GB | 65 tok/sec | Better reasoning on budget | β Yes |
| Llama 3.2 11B Vision | Q5_K_M | ~8 GB | 65 tok/sec | Image + text tasks | β Yes |
| Qwen3 8B | Q8_0 | ~8 GB | 85 tok/sec | Best multilingual + coding | β Yes |
| Mistral 7B v0.3 | FP16 | ~14 GB | -- | -- | β No (needs 14 GB at FP16) |
β’ProTip: π Best overall for 12 GB: Llama 4 Scout 17B Q4_K_M at ~10 GB. MoE architecture means 17B active params with 109B total β better quality than any dense 8B model at similar VRAM cost. If you prefer dense models, use Llama 3.1 8B Q8_0 at ~9 GB.
β’KeyPoint: RTX 3060 12GB is the budget entry point (~$200 used). It runs all 12 GB models but at ~60-70 tok/sec vs ~80-90 tok/sec on RTX 4070 Ti due to older memory architecture.
Which 70B Models Actually Fit in 24 GB VRAM (RTX 4090)?
The RTX 4090 has 24 GB VRAM -- not enough for most 70B models at acceptable quality. Llama 3.3 70B at Q4_K_M requires approximately 39 GB. The common misconception is that "Q4 is small" -- at 70B parameters, even Q4 is large.
On a single RTX 4090, the best strategy is 27-32B models, which deliver strong quality and fit comfortably. Qwen 3.6 27B at Q4_K_M is the best dense coding model (77.2% SWE-bench). For true 70B at Q4+, you need 2Γ RTX 4090 or a 48 GB server GPU. See how to run 70B models on 24 GB VRAM for advanced techniques.
| Model | Quantization | VRAM Required | Fits 24 GB? | Speed (RTX 4090) | Notes |
|---|---|---|---|---|---|
| Qwen 3.6 27B | Q4_K_M | ~16 GB | β Yes | 55 tok/sec | Best dense coding model, 77.2% SWE-bench |
| DeepSeek-R1 32B | Q4_K_M | ~19 GB | β Yes | 60 tok/sec | Best reasoning, strong overall quality |
| Qwen2.5 32B | Q5_K_M | ~21 GB | β Yes | 55 tok/sec | High quality, excellent coding + instruction |
| Qwen2.5 32B | Q8_0 | ~34 GB | β No | -- | Requires 48 GB GPU |
| Llama 3.3 70B | Q2_K | ~24 GB | β οΈ Barely | 30 tok/sec | Fits but Q2 quality is noticeably degraded |
| Llama 3.3 70B | Q4_K_M | ~39 GB | β No | -- | Needs 2Γ RTX 4090 or A100 80 GB |
β’KeyPoint: π Best for RTX 4090 (24 GB): Qwen 3.6 27B Q4_K_M (~16 GB, 77.2% SWE-bench) for best dense coding model. For reasoning: DeepSeek-R1 32B Q4_K_M (~19 GB, 60 tok/sec). Better than Llama 3.3 70B Q2_K at far less VRAM.
β οΈWarning: If you specifically need 70B quality at Q4+, the RTX 4090 is not the right GPU. You need 2Γ RTX 4090 (48 GB combined via tensor parallelism) or an RTX 6000 Ada (48 GB). Running 70B at Q2_K on a single 4090 noticeably hurts output quality.
What CPU and RAM Do You Need?
With a dedicated GPU, CPU and RAM are secondary components. The GPU handles matrix math; CPU/RAM manage context preparation. For a full comparison of GPU vs CPU vs Apple Silicon inference speeds, see the GPU vs CPU vs Apple Silicon guide.
Minimum CPU: 8-core processor (Intel Core i7 14th gen, AMD Ryzen 7 7700X, or newer). Older CPUs add 20%+ latency.
RAM: 16 GB minimum (with GPU). If running without GPU, 32+ GB recommended. RAM does not directly limit model size when GPU is present.
Storage: 500 GB SSD for model files and OS. M.2 NVMe is preferred (faster model loading).
Which Models Run Well on 16 GB System RAM Without a GPU?
Without a GPU, a machine with 16 GB system RAM can run 3B-7B models at 8-20 tokens/sec using CPU inference. The bottleneck is memory bandwidth, not RAM capacity -- CPUs have far lower bandwidth than GPUs, which is why inference is 5-10Γ slower.
On 16 GB system RAM, the practical rule is: model file size + 4 GB OS overhead β€ 16 GB. A 7B model at Q4_K_M (4.9 GB) fits, but leaves little headroom for long contexts. The table below shows realistic options as of May 2026.
For a complete speed-optimized model guide covering CPU-only, 4 GB, 6 GB, and 8 GB VRAM tiers with real benchmarks, see **Fastest Local LLMs for Low-End PCs**.
| Model | Quantization | RAM Used | Speed (Ryzen 9 7950X) | Best For | Notes |
|---|---|---|---|---|---|
| Gemma 2 2B | Q8_0 | ~2.7 GB | 28 tok/sec | Fastest, minimal RAM | Leaves 13 GB free for OS |
| Phi-4 Mini 3.8B | Q4_K_M | ~2.5 GB | 25 tok/sec | Coding on CPU | Best quality-per-RAM ratio |
| Llama 3.2 3B | Q8_0 | ~3.8 GB | 20 tok/sec | General chat, low RAM | Reliable, widely supported |
| Llama 3.1 8B | Q4_K_M | ~4.9 GB | 12 tok/sec | Best CPU quality | 12 tok/sec is slow but usable for batch tasks |
| Llama 3.1 8B | Q8_0 | ~9 GB | 8 tok/sec | Max quality on CPU | Too slow for interactive use on most CPUs |
β’ProTip: π Best for 16 GB RAM, no GPU: Phi-4 Mini 3.8B Q4_K_M (2.5 GB, 25 tok/sec). Delivers surprisingly strong coding and reasoning for its size.
β’KeyPoint: CPU vs GPU speed reality: A used NVIDIA RTX 3060 12 GB (~$200) runs Llama 3.1 8B at 70+ tok/sec -- 5-8Γ faster than the Ryzen 9 7950X at CPU-only inference. If speed matters, buy a GPU before adding RAM.
β οΈWarning: Running a 7B model on 16 GB RAM with CPU-only leaves fewer than 7 GB for the OS and browser. With long conversation contexts (32k+ tokens), the model file grows beyond its base size and can cause RAM exhaustion. Keep context size under 4096 on 16 GB CPU-only machines.
How Much Storage Do You Need?
Model files are large: a 7B model at 4-bit quantization is 4-5 GB. Plan storage around the number and size of models you want to keep locally.
- 500 GB SSD: OS + 1-2 small models (3B, 7B)
- 1 TB SSD: OS + 3-5 models (mix of 7B and 13B)
- 2 TB SSD: OS + 10+ models (various sizes)
- 4 TB NVMe RAID: Production setup, fast model loading
What Hardware Build Should You Buy?
Building a local LLM machine from scratch means prioritizing GPU first, then CPU and RAM. Here are three realistic configurations. For multi-GPU builds, see the multi-GPU local LLM guide.
| Budget | GPU | CPU | RAM | Models | Cost |
|---|---|---|---|---|---|
| $1500 (entry) | RTX 4070 Ti | i7 13700 | 16 GB | 7-13B | Realistic |
| $2500 (solid) | RTX 4080 | i7 14700K | 32 GB | 13-30B | Recommended |
| $4000 (high-end) | 2Γ RTX 4090 | Ryzen 9 7950X | 128 GB | Any (70B+) | Overkill for personal |
What If You Can't Afford the Hardware?
If a $250β400 GPU is outside your budget, or your laptop is too old to support modern inference engines, local LLMs may not be cost-effective for you in 2026.
Calculate the real cost:
- Local: $800β2,000 upfront hardware + electricity + maintenance over 2β3 years
- Cloud: $5β50/month for typical developer use (Llama API or GPT-4o mini)
For light users (< 100,000 tokens/month), cloud APIs cost $5β10/month and require zero hardware. For heavy users (> 10M tokens/month), local breaks even in 6β12 months.
Compare full local vs cloud cost and performance trade-offs** to find your break-even point. Many developers discover cloud is cheaper for their actual usage pattern.
Already shopping below the recommended VRAM tiers? See Best Local AI App for a Low-End PC for which model and app combinations actually run on 8 GB or less.
How Do You Maximize llama.cpp Speed on RTX 4070 Ti?
With correct settings, llama.cpp on an RTX 4070 Ti achieves 85-95 tokens/sec on Llama 3.1 8B Q4_K_M -- more than double the default out-of-box speed. The single most impactful flag is `--n-gpu-layers 99`, which offloads all model layers to the GPU. Without it, layers fall back to CPU, creating a severe bottleneck.
These settings apply to llama.cpp directly and to Ollama (which uses llama.cpp internally). Ollama sets `--n-gpu-layers 99` automatically on NVIDIA hardware if drivers are installed correctly.
- Q4_K_M beats Q4_0 by 15-20% on RTX 4070 Ti. The K_M variant uses mixed quantization that NVIDIA tensor cores accelerate more efficiently. Always choose Q4_K_M over Q4_0 when both are available.
- IQ4_XS is the smallest format (~8% smaller than Q4_K_M) with minimal quality loss. Useful for fitting Qwen2.5 14B into 12 GB VRAM when Q4_K_M is borderline.
- Q5_K_M runs at nearly the same speed as Q4_K_M on NVIDIA GPUs (< 5% slower) while providing noticeably better output quality. Worth using when you have 20% VRAM headroom.
| Flag | What It Does | Impact | Default | Notes |
|---|---|---|---|---|
| --n-gpu-layers 99 | Offloads all layers to GPU | +100-150% speed | 0 (CPU only) | Most important flag -- always set this first |
| --threads [cores] | CPU threads for prompt processing | +10-15% speed | All threads (including HT) | Set to physical core count only. Hyperthreading hurts inference. |
| --ctx-size 2048 | KV cache / context window size | Saves 0.5-8 GB VRAM | 4096 | 2048 = ~0.5 GB extra VRAM. 32768 = ~8 GB extra. Only increase if needed. |
| --n-batch 512 | Prompt processing batch size | +5-10% throughput | 512 | Good default. Increase to 1024 for batch workloads if VRAM allows. |
| --flash-attn | Flash Attention 2 kernel | -20-30% VRAM at long ctx | Disabled | Available since llama.cpp b2900. Reduces VRAM for contexts > 8k tokens. |
β’ProTip: Run `ollama ps` to confirm your model is loaded on GPU. If GPU utilization shows 0% in `nvidia-smi` while generating, drivers are not correctly routing to CUDA. Reinstall NVIDIA CUDA Toolkit and restart Ollama.
β’KeyPoint: RTX 4070 Ti speed reference: Llama 3.1 8B Q4_K_M = 85-95 tok/sec. Llama 3.1 13B Q4_K_M = 60-70 tok/sec. Qwen2.5 7B Q8_0 = 90-95 tok/sec. These assume --n-gpu-layers 99 and --ctx-size 2048.
β οΈWarning: Increasing --ctx-size beyond 8192 on a 12 GB GPU will cause model layer offloading back to CPU if the KV cache exhausts remaining VRAM. If speed drops suddenly on long conversations, reduce context size or use --flash-attn.
Can Mac Hardware Run Local LLMs?
Apple Silicon (M-series) runs local LLMs efficiently using unified memory shared between CPU and GPU. M5 introduced since October 2025 offers a significant upgrade for local inference. Apple claims 4Γ faster LLM prompt processing vs M4.
The M5 Max with 128 GB unified memory is the first Apple Silicon chip that comfortably runs 70B models at Q4_K_M -- comparable to dual RTX 4090 desktops but in a laptop or Mac Studio form factor. The M5 Pro with 64 GB unified memory handles 32B models with generous headroom for KV cache and multitasking.
| Mac | GPU Memory | Best For | Limitation |
|---|---|---|---|
| M3 MacBook Pro 16" | 18 GB unified | 7B models (fast) | Can run 13B slowly |
| M4 Max | 48-96 GB unified | 13-30B models | Not optimized for 70B |
| M5 Pro (MacBook Pro) | 64 GB unified, 307 GB/s | 30B models comfortably | Llama 4 Scout runs well |
| M5 Max (MacBook Pro / Studio) | 128 GB unified, 460-614 GB/s | 70B models at Q4_K_M | First Mac to fit 70B properly |
When Should You Use Server vs Consumer Hardware?
For production deployment (24/7 operation, multiple users), server-grade hardware is recommended over consumer GPUs. Consumer hardware is optimized for gaming, not sustained inference.
- Consumer (RTX 4090): ~$1800, 24 GB VRAM, single-user, prone to thermal throttling under sustained load.
- Server (RTX 6000 Ada): ~$5000, 48 GB VRAM, designed for 24/7 use, better cooling, error correction.
- Recommendation: Start with RTX 4090. If running 70B models 24/7 for multiple users, upgrade to dual A100 or RTX 6000.
NVIDIA DGX Spark: 128 GB Desktop AI Computer
The NVIDIA DGX Spark ($3,999) is the only consumer desktop as of May 2026 that fits Llama 3.3 70B at Q8_0 entirely in unified memory.
Built on the GB10 Grace Blackwell Superchip, the DGX Spark launched in late 2025 as a compact desktop AI computer with 128 GB LPDDR5x unified memory. As of May 2026, the DGX Spark also runs Llama 4 Scout and Maverick entirely in memory, as well as Kimi K2.6 (quantized), making it suitable for multi-GPU setups at this tier.
| Spec | Value |
|---|---|
| Unified memory | 128 GB LPDDR5x |
| Llama 3.3 70B at Q4_K_M | β fits (40 GB) |
| Llama 3.3 70B at Q8_0 | β fits (70 GB) |
| Inference speed (70B) | 18-28 tok/s |
| Price | $3,999 |
| OS | DGX OS (Ubuntu), Ollama pre-installed |
| vs RTX 4090 | 5Γ more VRAM, but 5Γ the price |
β’KeyPoint: Compared to 2Γ RTX 4090 (48 GB total, ~$3,600): DGX Spark has 2.7Γ more memory and faster unified bandwidth at a $400 premium. The RTX 4090 pair is better value unless you specifically need 70B at Q8_0 quality.
What Are the Most Common Hardware Mistakes?
- Buying CPU-only when GPU is available. A $600 RTX 4070 Ti will outperform a $2000 CPU. GPU dominates LLM speed.
- Not accounting for VRAM overhead. Model file size + system overhead + context = total VRAM used. Always buy 25% more than the model size.
- Assuming all 70B models fit in 40GB VRAM. They do, barely, in Q4 (4-bit) quantization only. Q5 requires 45+ GB.
- Ignoring power supply and cooling. RTX 4090 draws 575W. Need a 1200W PSU and good case airflow.
- Thinking an old GPU will work. RTX 2080 is 10Γ slower than RTX 4070 Ti. Modern GPU architecture significantly outperforms prior generations.
- Not accounting for KV cache VRAM on top of model weights: A 7B model at Q4_K_M is 4.7 GB of weights -- but with a 32K context window, the KV cache adds ~4 GB more, totalling ~8.7 GB. On an 8 GB card this causes OOM errors. Always add 25-100% to model size depending on context length.
- Treating hardware cost as the only cost: If you cannot afford 16+ GB RAM or a dedicated GPU, cloud APIs cost less for low-volume use ($0.01β0.05 per 1K tokens). See Local LLM vs Cloud: Cost Analysis for the full trade-off.
What Regional Compliance Rules Apply to Local LLM Hardware?
EU (GDPR + EU AI Act): Running LLMs locally keeps all inference data within your infrastructure, eliminating cross-border data transfer concerns under GDPR Article 44. As of May 2026, EU enterprises deploying LLMs for customer data processing must ensure models never phone home -- local hardware removes this risk entirely. EU AI Act high-risk system obligations apply from August 2, 2026 (pending Digital Omnibus which may delay to December 2027). Local hardware satisfies data residency requirements by default.
Japan (APPI): Japan's Act on the Protection of Personal Information (APPI) revision (2022) requires data minimization for AI processing. On-premises LLM hardware on a RTX 4090 workstation satisfies this requirement for document processing and customer support automation.
China: China's Cyberspace Administration of China (CAC) Generative AI Regulations (2023) require domestically-deployed AI models to undergo registration. Running local hardware with open-weight models avoids API-based compliance exposure for internal enterprise use.
Common Questions About Local LLM Hardware
Can I run a 70B model on a laptop?
Only with heavy quantization (Q2, 2-bit) and CPU fallback. Impractical. Laptops are suited for 7B models. For 70B, use a desktop with RTX 4090+.
Is RTX 4090 overkill for personal use?
Not if you run 70B models or multiple models simultaneously. For just 7B chat, RTX 4070 Ti suffices. RTX 4090 is future-proof if you want flexibility.
Should I buy RTX 5090 or wait for RTX 6090?
RTX 5090 is available (early 2026). RTX 6000 Ada server GPUs are also solid. Unless you have unlimited budget, RTX 5090 or 4090 are excellent.
How does quantization affect quality?
FP16 = 100% quality (baseline), Q8 = 99%, Q5 = 95%, Q4 = 90-95%. For most tasks, Q4 is indistinguishable from FP16.
Can I upgrade GPU later?
Yes. Start with RTX 4070 Ti now, upgrade to RTX 5090 in 2 years if needed. GPU is the most replaceable component.
How much RAM do I need to run a 7B model locally?
8 GB RAM is the absolute minimum for a 7B model. 16 GB is recommended for comfortable use alongside browser and OS. 32 GB gives headroom for larger context windows and multitasking.
Can I run local LLMs on Apple Silicon (M1/M2/M3/M4/M5)?
Yes. Apple Silicon uses unified memory shared between CPU and GPU. M5 Pro (64 GB, 307 GB/s) runs 30B models well. M5 Max (128 GB, 460-614 GB/s) is the first Mac to run 70B at Q4_K_M β comparable to dual RTX 4090 desktops.
What CPU is best for local LLMs without a GPU?
High-core-count CPUs with large L3 cache: AMD Ryzen 9 7950X or Intel Core i9-14900K. Expect 5-15 tokens/sec for 7B models. CPU inference is 3-5Γ slower than GPU.
Does storage speed affect local LLM performance?
Yes, at model load time. NVMe SSD (3-7 GB/s) loads a 7B model in 2-5 seconds vs. 20-60 seconds on HDD. Inference speed after loading is unaffected by storage.
Can I use multiple GPUs to run larger models?
Yes, via tensor parallelism. Two RTX 4090s (24 GB each) provide 48 GB VRAM for 70B models at FP16. Ollama and llama.cpp support multi-GPU via --n-gpu-layers split across cards.
What are the best local LLMs for 16 GB VRAM in 2026?
Mistral Small 3.1 24B Q4_K_M (13 GB, 55 tok/sec) is the best overall for RTX 4080 / RTX 5080 / RTX 4090 laptop. For agentic coding: Devstral Small 24B Q4_K_M (16 GB, 45 tok/sec). For reasoning: DeepSeek-R1 14B (15 GB, 40 tok/sec). Llama 3.3 70B does not fit -- it requires 39 GB at Q4_K_M.
Can a single RTX 4090 run a 70B model at good quality?
No -- not at Q4_K_M quality. Llama 3.3 70B at Q4_K_M requires ~39 GB VRAM. The RTX 4090 has 24 GB. You can run it at Q2_K (~24 GB) but quality drops noticeably. Better options: Qwen 3.6 27B Q4_K_M (~16 GB, 77.2% SWE-bench, best dense coding) or DeepSeek-R1 32B Q4_K_M (~19 GB, best reasoning).
What is the best local LLM for 16 GB system RAM without a GPU?
Phi-4 Mini 3.8B Q4_K_M (2.5 GB RAM, ~25 tok/sec on Ryzen 9 7950X) is the best option for CPU-only inference on 16 GB system RAM. Gemma 2 2B Q8 is the fastest at ~28 tok/sec. Llama 3.1 8B Q4_K_M (4.9 GB) also fits but runs at ~12 tok/sec -- slow for interactive use.
Sources
- NVIDIA. (2026). "GeForce GPU Specifications." https://www.nvidia.com/en-us/geforce/graphics-cards/ -- Official VRAM and bandwidth specs for RTX 40-series and RTX 50-series GPUs.
- Apple. (2026). "Apple M5 Chip." https://www.apple.com/mac/ -- M5 Pro/Max specifications, memory bandwidth, LLM performance claims. M5 is the first Mac that comfortably runs 70B models at Q4_K_M.
- NVIDIA. (2025). "DGX Spark Product Page." https://www.nvidia.com/en-us/products/workstations/dgx-spark/ -- Official specs for GB10 Grace Blackwell Superchip and 128 GB unified memory.
- Meta AI. (2024). "Llama 3.3 Model Card." https://llama.meta.com/ -- Official Llama 3.3 70B specifications and VRAM requirements.
- Meta AI. (2025). "Llama 4 Model Card." https://llama.meta.com/ -- Llama 4 Scout/Maverick MoE architecture, VRAM requirements.