PromptQuorumPromptQuorum
Home/Local LLMs/Local LLM Hardware in 2026: GPU vs Mini PC vs Mac Compared
Hardware & Performance

Local LLM Hardware in 2026: GPU vs Mini PC vs Mac Compared

Β·13 min readΒ·By Hans Kuepper Β· Founder of PromptQuorum, multi-model AI dispatch tool Β· PromptQuorum

Local LLM hardware requirements depend primarily on VRAM: 7B models need 8 GB, 13B models need 12-16 GB, and 70B models need 35-48 GB depending on quantization. GPU choice matters 10Γ— more than CPU for inference speed.

Running local LLMs requires matching your GPU's VRAM to the model you want to run. As of May 2026, a 7B model needs 8-9 GB VRAM at Q8, a 14B model needs 15 GB, and most 70B models need 39 GB at Q4_K_M -- more than a single RTX 4090 holds. This guide covers specific model recommendations for 12 GB, 16 GB, and 24 GB VRAM tiers, CPU-only inference on 16 GB system RAM, llama.cpp speed settings for RTX 4070 Ti, and full hardware build configurations.

Slide Deck: Local LLM Hardware in 2026: GPU vs Mini PC vs Mac Compared

The slide deck below covers: GPU VRAM tiers for 12/16/24 GB, best models per tier with VRAM usage and speed benchmarks, CPU-only inference on 16 GB RAM, and llama.cpp speed flags for RTX 4070 Ti. Download the PDF as a Local LLM Hardware Guide 2026 reference card.

Browse the slides below or download as PDF for offline reference. Download Reference Card (PDF)

Key Takeaways

  • VRAM math: (Model size in GB) Γ· Quantization = VRAM needed. Example: 70B at Q4 = 70 Γ· 8 = 8.75 GB Γ— parameters β‰ˆ 39 GB total.
  • 12 GB VRAM (RTX 4070 Ti): Best model: Llama 4 Scout 17B Q4_K_M (~10 GB, MoE, best overall quality). Also: Llama 3.1 8B Q8 (~9 GB, 80 tok/sec).
  • 16 GB VRAM (RTX 4080 / RTX 5080): Best model: Mistral Small 3.1 24B Q4_K_M (~13 GB, 55 tok/sec). Also: Devstral Small 24B Q4_K_M for agentic coding.
  • 24 GB VRAM (RTX 4090): Most 70B models at Q4_K_M (39 GB) do NOT fit. Best option: DeepSeek-R1 32B Q4_K_M (~19 GB, 60 tok/sec) or Qwen 3.6 27B (~16 GB, 77.2% SWE-bench).
  • CPU-only (16 GB system RAM): Llama 3.2 3B Q8 (20 tok/sec) or Phi-4 Mini Q4_K_M (25 tok/sec). A used RTX 4060 8GB (~$150) or RTX 5060 Ti 12GB (~$250) is 5-10Γ— faster.
  • Apple M5 Max (128 GB unified): First Mac to run 70B models at Q4_K_M β€” comparable to dual RTX 4090 desktops in a laptop or Mac Studio.
  • llama.cpp speed tip: Always set `--n-gpu-layers 99`. This alone doubles speed on RTX 4070 Ti from ~40 to ~85 tok/sec.
  • Quick reference: 7B@Q4_K_M = 4.7 GB | 70B@Q4_K_M = 40 GB | RTX 4070 Ti = ~80 tok/s | RTX 4090 = ~150 tok/s | CPU-only 16 GB = 12-28 tok/s

Best GPUs to Buy β€” 2026 Recommendations

Choosing a GPU depends on your budget and the model size you want to run. The NVIDIA RTX 40-series (4060, 4070 Ti, 4090) and RTX 50-series (5060 Ti, 5080) dominate for local LLMs in 2026. Here are the top recommendations by use case:

  • For 7B Models (Mistral, Phi-4, Llama 3.2) β€” Budget: RTX 4060 (8 GB VRAM, ~$180–220). Runs any 7B model at Q4_K_M. Speed: 40–60 tok/sec. Tier: Budget enthusiasts.
  • For 14B Models (Llama 3.1, DeepSeek-R1) β€” Mainstream: RTX 4070 Ti (12 GB VRAM, ~$500–600). Best price-to-performance. Llama 4 Scout 17B Q4 runs well. Speed: 85–120 tok/sec. Tier: Most popular.
  • For 33B Models (Qwen2.5, Mistral Small) β€” Mid-Range: RTX 4080 or RTX 5080 (16 GB VRAM, ~$1000–1200). Runs Devstral Small 24B Q4_K_M. Speed: 110–140 tok/sec. Tier: Professional developers.
  • For 70B Models (Llama 3.3, Qwen 3.6) β€” High-End: RTX 4090 (24 GB VRAM, ~$1700–2000). Runs 70B at Q3_K_M (~25 GB). For Q4_K_M (40 GB), use dual RTX 4090. Speed: 150–180 tok/sec single GPU. Tier: Research + production.
  • Best Value 2026: RTX 4070 Ti + RTX 5060 Ti 12GB combo (~$750 total) β€” runs 70B at Q3 and 14B at Q4 simultaneously.
  • For Apple Users: Mac M5 Max (128 GB unified memory) first Mac running true 70B models. ~$6000. Equivalent performance to dual RTX 4090 setup.
GPUBest ForPriceSpeedTier
RTX 4060 (8 GB)7B models~$180–22040–60 tok/sBudget
RTX 4070 Ti (12 GB)14B models~$500–60085–120 tok/sMainstream
RTX 4080 / RTX 5080 (16 GB)33B models~$1000–1200110–140 tok/sProfessional
RTX 4090 (24 GB)70B (Q3)~$1700–2000150–180 tok/sHigh-end
Dual RTX 409070B (Q4)~$3400–4000280–360 tok/sEnterprise
Mac M5 Max 128GB70B (Q4)~$6000120–160 tok/sPro laptop

How Do You Calculate VRAM Requirements?

VRAM requirements depend on three factors: model size (parameters), quantization (bits per weight), and inference mode. Use this formula to determine if your GPU has enough memory. For an interactive calculator, see the VRAM calculator for local LLMs.

Formula:

```text VRAM (GB) = (Model Size Γ— Quantization Bits) Γ· 8 ```

Quantization values: FP16 = 16 bits, Q8_0 = 8 bits, Q5_K_M = 5 bits, Q4_K_M = 4 bits. The practical sweet spot is Q4_K_M -- it uses 4-bit weights with K-quantization, which NVIDIA GPUs accelerate more efficiently than the older Q4_0 format.

ModelFP16Q8_0Q5_K_MQ4_K_M
Llama 4 Scout 17B (active)~34 GB~18 GB~12 GB~10 GB
Llama 3.1 8B16 GB8.5 GB5.7 GB4.7 GB
Qwen 3.6 27B~54 GB~28 GB~19 GB~16 GB
Qwen3 8B~16 GB~8.5 GB~5.7 GB~5 GB
Llama 3.3 70B140 GB70 GB48 GB40 GB
Qwen2.5 32B64 GB33 GB22 GB19 GB
Mistral Small 3.1 24B48 GB25 GB17 GB14 GB
Phi-4 Mini 3.8B7.6 GB4.1 GB2.7 GB2.3 GB

Q4_K_M is the recommended default for consumer hardware -- 90-95% of FP16 quality at 25-30% of the VRAM cost. Llama 4 Scout uses MoE architecture with 17B active parameters out of 109B total. VRAM is determined by active params for inference, not total params.

VRAM calculator showing the formula (Model Size Γ— Bits) Γ· 8, with examples: 8B Q4_K_M = 4.7 GB, 13B Q5_K_M = 9.1 GB, 70B Q4_K_M = 40 GB. Q4_K_M is the recommended sweet spot for most hardware.
VRAM calculator showing the formula (Model Size Γ— Bits) Γ· 8, with examples: 8B Q4_K_M = 4.7 GB, 13B Q5_K_M = 9.1 GB, 70B Q4_K_M = 40 GB. Q4_K_M is the recommended sweet spot for most hardware.

β€’KeyPoint: In one sentence: VRAM is the GPU's dedicated memory pool -- the single number that determines which AI models you can run locally and at what quality.

KV Cache: The Hidden VRAM Cost

The VRAM formula (Model Size Γ— Bits Γ· 8) covers model weights only -- KV cache adds significant additional VRAM that most guides ignore.

The KV cache stores attention state for every token in your context window. It grows linearly with context length and stays in VRAM throughout the session.

KV cache VRAM formula: `KV cache β‰ˆ layers Γ— heads Γ— head_dim Γ— 2 Γ— context_length Γ— 2 bytes`

Model4K context32K context128K context
Llama 3.1 8B0.5 GB4 GB16 GB
Llama 3.3 70B2 GB16 GB64 GB
Qwen2.5 32B1 GB8 GB32 GB

β€’KeyPoint: In one sentence: KV cache is temporary VRAM used to store conversation context -- it grows with every token you generate and is separate from model weight storage.

⚠️Warning: A Llama 3.1 8B at Q4_K_M needs 4.7 GB for weights -- but add a 32K context window and total VRAM rises to ~8.7 GB. On an 8 GB card, this causes OOM errors.

β€’KeyPoint: Rule of thumb: Add 25% to model weight size for typical 8K context, 100% for 32K context. Ollama default context is 2,048 tokens. To set higher: PARAMETER num_ctx 32768 in your Modelfile.

Which GPU Tier Matches Your Workload?

As of May 2026, NVIDIA GPUs deliver the highest tokens/sec for local LLM inference across all price points. The sections below each tier give specific model recommendations. For a detailed benchmark comparison, see the best GPUs for local LLM guide.

TierGPUVRAMBest ForSpeed
Budget ($600)RTX 4070 Ti / RTX 507012 GB7-13B models~80 tok/s
Mid ($900)RTX 5070 Ti16 GB13-30B models~100 tok/s
High ($1,200)RTX 4080 / RTX 508016 GB13-30B models~120 tok/s
Top ($1,800)RTX 409024 GB32B models, 70B at Q2_K~150 tok/s
Latest ($2,000)RTX 509032 GB70B + headroom~200 tok/s
Server ($3,000+)RTX 6000 Ada / A10048+ GBMulti-user, 70B+Production
Desktop AI ($3,999)NVIDIA DGX Spark128 GBAny model, unified18-28 tok/s
GPU tier recommendations: $600 RTX 4070 Ti (12GB, 7-13B models, 80 tok/s), $1,200 RTX 4080 (16GB, 13-30B, 120 tok/s), $1,800 RTX 4090 (24GB, 70B, 150 tok/s), $2,000 RTX 5090 (32GB, 70B+, 200 tok/s), $3,999 DGX Spark (128GB, any model). GPU choice matters 10Γ— more than CPU.
GPU tier recommendations: $600 RTX 4070 Ti (12GB, 7-13B models, 80 tok/s), $1,200 RTX 4080 (16GB, 13-30B, 120 tok/s), $1,800 RTX 4090 (24GB, 70B, 150 tok/s), $2,000 RTX 5090 (32GB, 70B+, 200 tok/s), $3,999 DGX Spark (128GB, any model). GPU choice matters 10Γ— more than CPU.

β€’KeyPoint: As of May 2026, the RTX 50-series (Blackwell) is the current generation. RTX 5090 (32 GB) is future-proof for 70B models. RTX 4090 remains excellent value for existing buyers.

Best Local LLMs by VRAM Tier (May 2026)

Use this as a quick lookup by your GPU's VRAM tier:

All models listed below are open-weights β€” downloadable, fine-tunable, and free to run locally. If you're choosing between open-weights and proprietary APIs, see our open-source vs proprietary LLMs comparison for cost and performance trade-offs at different token volumes.

Hardware determines which models you can run; prompt engineering determines how well they perform. A well-structured prompt on a 7B model often outperforms a lazy prompt on a 70B model. See the complete prompt engineering guide for techniques that maximise output quality at any parameter count.

  • 8 GB VRAM (RTX 4060, RTX 5060 Ti, Intel B580): Llama 3.1 8B Q4_K_M (4.7 GB, ~70 tok/s) -- recommended. Qwen3 8B (5 GB, best multilingual + coding). Phi-4 Mini 3.8B (2.3 GB, fastest). Gemma 2 9B (5.5 GB, fits with care). Avoid 13B+ models.
  • 12 GB VRAM (RTX 4070 Ti, RTX 5070, Intel B770): Llama 4 Scout 17B Q4_K_M (~10 GB, best overall quality, MoE). Llama 3.1 8B (4.7 GB, fast with headroom). Qwen2.5 14B Q4_K_M (8.5 GB, better reasoning on budget). DeepSeek-R1 8B (5 GB, best reasoning). Avoid 30B+.
  • 16 GB VRAM (RTX 4080, RTX 5070 Ti, RTX 5080): Mistral Small 3.1 24B Q4_K_M (14 GB, best quality at tier). Devstral Small 24B Q4_K_M (~16 GB) for agentic coding. Qwen2.5 14B (9 GB, fast with context headroom). Llama 3.3 70B at Q2_K (17 GB, possible but degraded quality).
  • 24 GB VRAM (RTX 5090, RTX 4090, Tesla L40): Qwen 3.6 27B Q4_K_M (~16 GB, 77.2% SWE-bench, best dense coding model). DeepSeek-R1 32B Q4_K_M (~19 GB, best reasoning). Qwen2.5 32B Q5_K_M (~21 GB). Llama 3.3 70B needs 2Γ— 24 GB GPUs at Q4_K_M.
  • 32 GB VRAM (RTX 5090): Llama 3.3 70B Q4_K_M (40 GB -- needs minimal CPU offload for last layers). Kimi K2.6 quantized (MoE, 42B active, MIT license, best coding). Qwen2.5 32B (19 GB, fits entirely with 13 GB spare). RTX 5090 is the first single consumer GPU that fits 70B with minimal offload.
  • 48+ GB VRAM (RTX 6000 Ada, A100, DGX Spark): Llama 3.3 70B Q4_K_M (40 GB, fits entirely). Llama 4 Maverick (17B active, 400B total, MoE). Llama 3.3 70B Q8_0 (70 GB -- needs 80 GB A100). NVIDIA DGX Spark (128 GB unified) fits every open-weight model including 70B at Q8_0 with 58 GB to spare.

Which Local LLMs Run Best on 16 GB VRAM?

On a 16 GB VRAM GPU (NVIDIA RTX 4080, RTX 5080, or RTX 4090 laptop), the practical ceiling is 14-24B models. Mistral Small 3.1 24B at Q4_K_M is the best overall choice: it uses 13 GB VRAM, runs at 55 tok/sec, and is EU-origin with Apache 2.0 license.

Devstral Small 24B Q4_K_M fits at ~16 GB and is optimized for agentic coding workflows. The table below shows which models fit and which do not. "Does NOT fit" rows are included intentionally -- this is the most common mistake 16 GB owners make.

ModelQuantizationVRAM UsedSpeed (RTX 4080)Best ForFits 16 GB?
Mistral Small 3.1 24BQ4_K_M~13 GB55 tok/secGeneral chatβœ… Yes
Devstral Small 24BQ4_K_M~16 GB45 tok/secAgentic codingβœ… Tight
Qwen2.5 14BQ8_0~15 GB45 tok/secCoding + reasoningβœ… Yes
DeepSeek-R1 14BQ8_0~15 GB40 tok/secMath + analysisβœ… Yes
Llama 3.1 8BFP16~16 GB70 tok/secFastest responsesβœ… Tight
Llama 3.3 70BQ4_K_M~39 GB----❌ No (needs 39 GB)
Bar chart showing which models fit in 16 GB VRAM: Mistral Small 3.1 24B Q4_K_M (13 GB βœ…), Devstral Small 24B Q4_K_M (16 GB βœ…), Qwen2.5 14B Q8_0 (15 GB βœ…), Llama 3.3 70B Q4_K_M (39 GB ❌). Best choice: Mistral Small 3.1 24B for 55 tok/sec.
Bar chart showing which models fit in 16 GB VRAM: Mistral Small 3.1 24B Q4_K_M (13 GB βœ…), Devstral Small 24B Q4_K_M (16 GB βœ…), Qwen2.5 14B Q8_0 (15 GB βœ…), Llama 3.3 70B Q4_K_M (39 GB ❌). Best choice: Mistral Small 3.1 24B for 55 tok/sec.

β€’ProTip: πŸ† Best overall for 16 GB: Mistral Small 3.1 24B Q4_K_M at ~13 GB, 55 tok/sec. For agentic coding, use Devstral Small 24B (Mistral AI, France) at 45 tok/sec. Best reasoning: DeepSeek-R1 14B Q8_0 at 40 tok/sec.

⚠️Warning: RTX 4090 laptop GPUs have 16 GB VRAM (not 24 GB). They share the same model ceiling as the RTX 4080 desktop.

β€’KeyPoint: When to upgrade to 24 GB (RTX 4090 desktop): only if you need 32B+ models at Q8, or want to run two models simultaneously without reloading.

Which Local LLMs Run Best on 12 GB VRAM?

On a 12 GB VRAM GPU (NVIDIA RTX 4070 Ti, RTX 5070, or RTX 5060 Ti), you can run 7-8B models at Q8, 14B at Q4_K_M, or the new Llama 4 Scout 17B at Q4_K_M (MoE). Llama 4 Scout uses a Mixture-of-Experts architecture with 17B active parameters out of 109B total -- this makes Scout significantly more VRAM-efficient than its parameter count suggests.

Llama 3.1 8B at Q8_0 is the most reliable choice for conservative setups: 9 GB VRAM, 80 tok/sec, and full instruction-following quality. Qwen2.5 14B at Q4_K_M also fits at ~8.5 GB and delivers notably better reasoning than the 8B tier.

ModelQuantizationVRAM UsedSpeed (RTX 4070 Ti)Best ForFits 12 GB?
Llama 4 Scout 17BQ4_K_M~10 GB~65 tok/secBest overall (MoE)βœ… Yes
Llama 3.1 8BQ8_0~9 GB80 tok/secGeneral chat + codingβœ… Yes
Qwen2.5 14BQ4_K_M~8.5 GB65 tok/secBetter reasoning on budgetβœ… Yes
Llama 3.2 11B VisionQ5_K_M~8 GB65 tok/secImage + text tasksβœ… Yes
Qwen3 8BQ8_0~8 GB85 tok/secBest multilingual + codingβœ… Yes
Mistral 7B v0.3FP16~14 GB----❌ No (needs 14 GB at FP16)

β€’ProTip: πŸ† Best overall for 12 GB: Llama 4 Scout 17B Q4_K_M at ~10 GB. MoE architecture means 17B active params with 109B total β€” better quality than any dense 8B model at similar VRAM cost. If you prefer dense models, use Llama 3.1 8B Q8_0 at ~9 GB.

β€’KeyPoint: RTX 3060 12GB is the budget entry point (~$200 used). It runs all 12 GB models but at ~60-70 tok/sec vs ~80-90 tok/sec on RTX 4070 Ti due to older memory architecture.

Which 70B Models Actually Fit in 24 GB VRAM (RTX 4090)?

The RTX 4090 has 24 GB VRAM -- not enough for most 70B models at acceptable quality. Llama 3.3 70B at Q4_K_M requires approximately 39 GB. The common misconception is that "Q4 is small" -- at 70B parameters, even Q4 is large.

On a single RTX 4090, the best strategy is 27-32B models, which deliver strong quality and fit comfortably. Qwen 3.6 27B at Q4_K_M is the best dense coding model (77.2% SWE-bench). For true 70B at Q4+, you need 2Γ— RTX 4090 or a 48 GB server GPU. See how to run 70B models on 24 GB VRAM for advanced techniques.

ModelQuantizationVRAM RequiredFits 24 GB?Speed (RTX 4090)Notes
Qwen 3.6 27BQ4_K_M~16 GBβœ… Yes55 tok/secBest dense coding model, 77.2% SWE-bench
DeepSeek-R1 32BQ4_K_M~19 GBβœ… Yes60 tok/secBest reasoning, strong overall quality
Qwen2.5 32BQ5_K_M~21 GBβœ… Yes55 tok/secHigh quality, excellent coding + instruction
Qwen2.5 32BQ8_0~34 GB❌ No--Requires 48 GB GPU
Llama 3.3 70BQ2_K~24 GB⚠️ Barely30 tok/secFits but Q2 quality is noticeably degraded
Llama 3.3 70BQ4_K_M~39 GB❌ No--Needs 2Γ— RTX 4090 or A100 80 GB
VRAM requirements vs RTX 4090 24 GB limit: Qwen 3.6 27B Q4_K_M (16 GB βœ…), DeepSeek-R1 32B Q4_K_M (19 GB βœ…), Qwen2.5 32B Q5_K_M (21 GB βœ…), Llama 3.3 70B Q4_K_M (39 GB ❌ -- exceeds 24 GB by 63%). Sweet spot: 27-32B models at Q4-Q5.
VRAM requirements vs RTX 4090 24 GB limit: Qwen 3.6 27B Q4_K_M (16 GB βœ…), DeepSeek-R1 32B Q4_K_M (19 GB βœ…), Qwen2.5 32B Q5_K_M (21 GB βœ…), Llama 3.3 70B Q4_K_M (39 GB ❌ -- exceeds 24 GB by 63%). Sweet spot: 27-32B models at Q4-Q5.

β€’KeyPoint: πŸ† Best for RTX 4090 (24 GB): Qwen 3.6 27B Q4_K_M (~16 GB, 77.2% SWE-bench) for best dense coding model. For reasoning: DeepSeek-R1 32B Q4_K_M (~19 GB, 60 tok/sec). Better than Llama 3.3 70B Q2_K at far less VRAM.

⚠️Warning: If you specifically need 70B quality at Q4+, the RTX 4090 is not the right GPU. You need 2Γ— RTX 4090 (48 GB combined via tensor parallelism) or an RTX 6000 Ada (48 GB). Running 70B at Q2_K on a single 4090 noticeably hurts output quality.

What CPU and RAM Do You Need?

With a dedicated GPU, CPU and RAM are secondary components. The GPU handles matrix math; CPU/RAM manage context preparation. For a full comparison of GPU vs CPU vs Apple Silicon inference speeds, see the GPU vs CPU vs Apple Silicon guide.

Minimum CPU: 8-core processor (Intel Core i7 14th gen, AMD Ryzen 7 7700X, or newer). Older CPUs add 20%+ latency.

RAM: 16 GB minimum (with GPU). If running without GPU, 32+ GB recommended. RAM does not directly limit model size when GPU is present.

Storage: 500 GB SSD for model files and OS. M.2 NVMe is preferred (faster model loading).

Which Models Run Well on 16 GB System RAM Without a GPU?

Without a GPU, a machine with 16 GB system RAM can run 3B-7B models at 8-20 tokens/sec using CPU inference. The bottleneck is memory bandwidth, not RAM capacity -- CPUs have far lower bandwidth than GPUs, which is why inference is 5-10Γ— slower.

On 16 GB system RAM, the practical rule is: model file size + 4 GB OS overhead ≀ 16 GB. A 7B model at Q4_K_M (4.9 GB) fits, but leaves little headroom for long contexts. The table below shows realistic options as of May 2026.

For a complete speed-optimized model guide covering CPU-only, 4 GB, 6 GB, and 8 GB VRAM tiers with real benchmarks, see **Fastest Local LLMs for Low-End PCs**.

ModelQuantizationRAM UsedSpeed (Ryzen 9 7950X)Best ForNotes
Gemma 2 2BQ8_0~2.7 GB28 tok/secFastest, minimal RAMLeaves 13 GB free for OS
Phi-4 Mini 3.8BQ4_K_M~2.5 GB25 tok/secCoding on CPUBest quality-per-RAM ratio
Llama 3.2 3BQ8_0~3.8 GB20 tok/secGeneral chat, low RAMReliable, widely supported
Llama 3.1 8BQ4_K_M~4.9 GB12 tok/secBest CPU quality12 tok/sec is slow but usable for batch tasks
Llama 3.1 8BQ8_0~9 GB8 tok/secMax quality on CPUToo slow for interactive use on most CPUs
CPU-only inference speeds on Ryzen 9 7950X: Gemma 2 2B Q8_0 (28 tok/sec fastest), Phi-4 Mini Q4_K_M (25 tok/sec best choice), Llama 3.1 8B Q8_0 (8 tok/sec). A used RTX 3060 ($200) achieves 5-8Γ— faster.
CPU-only inference speeds on Ryzen 9 7950X: Gemma 2 2B Q8_0 (28 tok/sec fastest), Phi-4 Mini Q4_K_M (25 tok/sec best choice), Llama 3.1 8B Q8_0 (8 tok/sec). A used RTX 3060 ($200) achieves 5-8Γ— faster.

β€’ProTip: πŸ† Best for 16 GB RAM, no GPU: Phi-4 Mini 3.8B Q4_K_M (2.5 GB, 25 tok/sec). Delivers surprisingly strong coding and reasoning for its size.

β€’KeyPoint: CPU vs GPU speed reality: A used NVIDIA RTX 3060 12 GB (~$200) runs Llama 3.1 8B at 70+ tok/sec -- 5-8Γ— faster than the Ryzen 9 7950X at CPU-only inference. If speed matters, buy a GPU before adding RAM.

⚠️Warning: Running a 7B model on 16 GB RAM with CPU-only leaves fewer than 7 GB for the OS and browser. With long conversation contexts (32k+ tokens), the model file grows beyond its base size and can cause RAM exhaustion. Keep context size under 4096 on 16 GB CPU-only machines.

How Much Storage Do You Need?

Model files are large: a 7B model at 4-bit quantization is 4-5 GB. Plan storage around the number and size of models you want to keep locally.

  • 500 GB SSD: OS + 1-2 small models (3B, 7B)
  • 1 TB SSD: OS + 3-5 models (mix of 7B and 13B)
  • 2 TB SSD: OS + 10+ models (various sizes)
  • 4 TB NVMe RAID: Production setup, fast model loading

What Hardware Build Should You Buy?

Building a local LLM machine from scratch means prioritizing GPU first, then CPU and RAM. Here are three realistic configurations. For multi-GPU builds, see the multi-GPU local LLM guide.

BudgetGPUCPURAMModelsCost
$1500 (entry)RTX 4070 Tii7 1370016 GB7-13BRealistic
$2500 (solid)RTX 4080i7 14700K32 GB13-30BRecommended
$4000 (high-end)2Γ— RTX 4090Ryzen 9 7950X128 GBAny (70B+)Overkill for personal
Three build configurations: $1500 entry-level (RTX 4070 Ti, i7 13700, 16GB) for 7-13B models, $2500 solid build (RTX 4080, i7 14700K, 32GB) for 13-30B, $4000 high-end (2Γ— RTX 4090, Ryzen 9, 128GB) for any model. Mid-level offers best value.
Three build configurations: $1500 entry-level (RTX 4070 Ti, i7 13700, 16GB) for 7-13B models, $2500 solid build (RTX 4080, i7 14700K, 32GB) for 13-30B, $4000 high-end (2Γ— RTX 4090, Ryzen 9, 128GB) for any model. Mid-level offers best value.

What If You Can't Afford the Hardware?

If a $250–400 GPU is outside your budget, or your laptop is too old to support modern inference engines, local LLMs may not be cost-effective for you in 2026.

Calculate the real cost:

- Local: $800–2,000 upfront hardware + electricity + maintenance over 2–3 years

- Cloud: $5–50/month for typical developer use (Llama API or GPT-4o mini)

For light users (< 100,000 tokens/month), cloud APIs cost $5–10/month and require zero hardware. For heavy users (> 10M tokens/month), local breaks even in 6–12 months.

Compare full local vs cloud cost and performance trade-offs** to find your break-even point. Many developers discover cloud is cheaper for their actual usage pattern.

Already shopping below the recommended VRAM tiers? See Best Local AI App for a Low-End PC for which model and app combinations actually run on 8 GB or less.

How Do You Maximize llama.cpp Speed on RTX 4070 Ti?

With correct settings, llama.cpp on an RTX 4070 Ti achieves 85-95 tokens/sec on Llama 3.1 8B Q4_K_M -- more than double the default out-of-box speed. The single most impactful flag is `--n-gpu-layers 99`, which offloads all model layers to the GPU. Without it, layers fall back to CPU, creating a severe bottleneck.

These settings apply to llama.cpp directly and to Ollama (which uses llama.cpp internally). Ollama sets `--n-gpu-layers 99` automatically on NVIDIA hardware if drivers are installed correctly.

  • Q4_K_M beats Q4_0 by 15-20% on RTX 4070 Ti. The K_M variant uses mixed quantization that NVIDIA tensor cores accelerate more efficiently. Always choose Q4_K_M over Q4_0 when both are available.
  • IQ4_XS is the smallest format (~8% smaller than Q4_K_M) with minimal quality loss. Useful for fitting Qwen2.5 14B into 12 GB VRAM when Q4_K_M is borderline.
  • Q5_K_M runs at nearly the same speed as Q4_K_M on NVIDIA GPUs (< 5% slower) while providing noticeably better output quality. Worth using when you have 20% VRAM headroom.
FlagWhat It DoesImpactDefaultNotes
--n-gpu-layers 99Offloads all layers to GPU+100-150% speed0 (CPU only)Most important flag -- always set this first
--threads [cores]CPU threads for prompt processing+10-15% speedAll threads (including HT)Set to physical core count only. Hyperthreading hurts inference.
--ctx-size 2048KV cache / context window sizeSaves 0.5-8 GB VRAM40962048 = ~0.5 GB extra VRAM. 32768 = ~8 GB extra. Only increase if needed.
--n-batch 512Prompt processing batch size+5-10% throughput512Good default. Increase to 1024 for batch workloads if VRAM allows.
--flash-attnFlash Attention 2 kernel-20-30% VRAM at long ctxDisabledAvailable since llama.cpp b2900. Reduces VRAM for contexts > 8k tokens.
Default llama.cpp config: ~40 tok/sec. Optimized (--n-gpu-layers 99 + --ctx-size 2048 + --flash-attn): ~90 tok/sec -- a 125% speed improvement on RTX 4070 Ti running Llama 3.1 8B Q4_K_M.
Default llama.cpp config: ~40 tok/sec. Optimized (--n-gpu-layers 99 + --ctx-size 2048 + --flash-attn): ~90 tok/sec -- a 125% speed improvement on RTX 4070 Ti running Llama 3.1 8B Q4_K_M.

β€’ProTip: Run `ollama ps` to confirm your model is loaded on GPU. If GPU utilization shows 0% in `nvidia-smi` while generating, drivers are not correctly routing to CUDA. Reinstall NVIDIA CUDA Toolkit and restart Ollama.

β€’KeyPoint: RTX 4070 Ti speed reference: Llama 3.1 8B Q4_K_M = 85-95 tok/sec. Llama 3.1 13B Q4_K_M = 60-70 tok/sec. Qwen2.5 7B Q8_0 = 90-95 tok/sec. These assume --n-gpu-layers 99 and --ctx-size 2048.

⚠️Warning: Increasing --ctx-size beyond 8192 on a 12 GB GPU will cause model layer offloading back to CPU if the KV cache exhausts remaining VRAM. If speed drops suddenly on long conversations, reduce context size or use --flash-attn.

Can Mac Hardware Run Local LLMs?

Apple Silicon (M-series) runs local LLMs efficiently using unified memory shared between CPU and GPU. M5 introduced since October 2025 offers a significant upgrade for local inference. Apple claims 4Γ— faster LLM prompt processing vs M4.

The M5 Max with 128 GB unified memory is the first Apple Silicon chip that comfortably runs 70B models at Q4_K_M -- comparable to dual RTX 4090 desktops but in a laptop or Mac Studio form factor. The M5 Pro with 64 GB unified memory handles 32B models with generous headroom for KV cache and multitasking.

MacGPU MemoryBest ForLimitation
M3 MacBook Pro 16"18 GB unified7B models (fast)Can run 13B slowly
M4 Max48-96 GB unified13-30B modelsNot optimized for 70B
M5 Pro (MacBook Pro)64 GB unified, 307 GB/s30B models comfortablyLlama 4 Scout runs well
M5 Max (MacBook Pro / Studio)128 GB unified, 460-614 GB/s70B models at Q4_K_MFirst Mac to fit 70B properly
Mac hardware comparison: M3 MacBook Pro 16" (18GB, 7B), M4 Max (48-96GB, 13-30B), M5 Pro (64GB, 30B), M5 Max (128GB, 70B at Q4_K_M). M5 Max is first Mac to handle 70B models comparable to dual RTX 4090 desktops.
Mac hardware comparison: M3 MacBook Pro 16" (18GB, 7B), M4 Max (48-96GB, 13-30B), M5 Pro (64GB, 30B), M5 Max (128GB, 70B at Q4_K_M). M5 Max is first Mac to handle 70B models comparable to dual RTX 4090 desktops.

When Should You Use Server vs Consumer Hardware?

For production deployment (24/7 operation, multiple users), server-grade hardware is recommended over consumer GPUs. Consumer hardware is optimized for gaming, not sustained inference.

  • Consumer (RTX 4090): ~$1800, 24 GB VRAM, single-user, prone to thermal throttling under sustained load.
  • Server (RTX 6000 Ada): ~$5000, 48 GB VRAM, designed for 24/7 use, better cooling, error correction.
  • Recommendation: Start with RTX 4090. If running 70B models 24/7 for multiple users, upgrade to dual A100 or RTX 6000.
Consumer vs server hardware: RTX 4090 ($1800, 24GB, single-user, part-time) vs RTX 6000 Ada ($5000+, 48GB, multi-user, 24/7 duty). Start with consumer hardware; upgrade to server-grade only if running production services.
Consumer vs server hardware: RTX 4090 ($1800, 24GB, single-user, part-time) vs RTX 6000 Ada ($5000+, 48GB, multi-user, 24/7 duty). Start with consumer hardware; upgrade to server-grade only if running production services.

NVIDIA DGX Spark: 128 GB Desktop AI Computer

The NVIDIA DGX Spark ($3,999) is the only consumer desktop as of May 2026 that fits Llama 3.3 70B at Q8_0 entirely in unified memory.

Built on the GB10 Grace Blackwell Superchip, the DGX Spark launched in late 2025 as a compact desktop AI computer with 128 GB LPDDR5x unified memory. As of May 2026, the DGX Spark also runs Llama 4 Scout and Maverick entirely in memory, as well as Kimi K2.6 (quantized), making it suitable for multi-GPU setups at this tier.

SpecValue
Unified memory128 GB LPDDR5x
Llama 3.3 70B at Q4_K_Mβœ… fits (40 GB)
Llama 3.3 70B at Q8_0βœ… fits (70 GB)
Inference speed (70B)18-28 tok/s
Price$3,999
OSDGX OS (Ubuntu), Ollama pre-installed
vs RTX 40905Γ— more VRAM, but 5Γ— the price

β€’KeyPoint: Compared to 2Γ— RTX 4090 (48 GB total, ~$3,600): DGX Spark has 2.7Γ— more memory and faster unified bandwidth at a $400 premium. The RTX 4090 pair is better value unless you specifically need 70B at Q8_0 quality.

What Are the Most Common Hardware Mistakes?

  • Buying CPU-only when GPU is available. A $600 RTX 4070 Ti will outperform a $2000 CPU. GPU dominates LLM speed.
  • Not accounting for VRAM overhead. Model file size + system overhead + context = total VRAM used. Always buy 25% more than the model size.
  • Assuming all 70B models fit in 40GB VRAM. They do, barely, in Q4 (4-bit) quantization only. Q5 requires 45+ GB.
  • Ignoring power supply and cooling. RTX 4090 draws 575W. Need a 1200W PSU and good case airflow.
  • Thinking an old GPU will work. RTX 2080 is 10Γ— slower than RTX 4070 Ti. Modern GPU architecture significantly outperforms prior generations.
  • Not accounting for KV cache VRAM on top of model weights: A 7B model at Q4_K_M is 4.7 GB of weights -- but with a 32K context window, the KV cache adds ~4 GB more, totalling ~8.7 GB. On an 8 GB card this causes OOM errors. Always add 25-100% to model size depending on context length.
  • Treating hardware cost as the only cost: If you cannot afford 16+ GB RAM or a dedicated GPU, cloud APIs cost less for low-volume use ($0.01–0.05 per 1K tokens). See Local LLM vs Cloud: Cost Analysis for the full trade-off.

What Regional Compliance Rules Apply to Local LLM Hardware?

EU (GDPR + EU AI Act): Running LLMs locally keeps all inference data within your infrastructure, eliminating cross-border data transfer concerns under GDPR Article 44. As of May 2026, EU enterprises deploying LLMs for customer data processing must ensure models never phone home -- local hardware removes this risk entirely. EU AI Act high-risk system obligations apply from August 2, 2026 (pending Digital Omnibus which may delay to December 2027). Local hardware satisfies data residency requirements by default.

Japan (APPI): Japan's Act on the Protection of Personal Information (APPI) revision (2022) requires data minimization for AI processing. On-premises LLM hardware on a RTX 4090 workstation satisfies this requirement for document processing and customer support automation.

China: China's Cyberspace Administration of China (CAC) Generative AI Regulations (2023) require domestically-deployed AI models to undergo registration. Running local hardware with open-weight models avoids API-based compliance exposure for internal enterprise use.

Common Questions About Local LLM Hardware

Can I run a 70B model on a laptop?

Only with heavy quantization (Q2, 2-bit) and CPU fallback. Impractical. Laptops are suited for 7B models. For 70B, use a desktop with RTX 4090+.

Is RTX 4090 overkill for personal use?

Not if you run 70B models or multiple models simultaneously. For just 7B chat, RTX 4070 Ti suffices. RTX 4090 is future-proof if you want flexibility.

Should I buy RTX 5090 or wait for RTX 6090?

RTX 5090 is available (early 2026). RTX 6000 Ada server GPUs are also solid. Unless you have unlimited budget, RTX 5090 or 4090 are excellent.

How does quantization affect quality?

FP16 = 100% quality (baseline), Q8 = 99%, Q5 = 95%, Q4 = 90-95%. For most tasks, Q4 is indistinguishable from FP16.

Can I upgrade GPU later?

Yes. Start with RTX 4070 Ti now, upgrade to RTX 5090 in 2 years if needed. GPU is the most replaceable component.

How much RAM do I need to run a 7B model locally?

8 GB RAM is the absolute minimum for a 7B model. 16 GB is recommended for comfortable use alongside browser and OS. 32 GB gives headroom for larger context windows and multitasking.

Can I run local LLMs on Apple Silicon (M1/M2/M3/M4/M5)?

Yes. Apple Silicon uses unified memory shared between CPU and GPU. M5 Pro (64 GB, 307 GB/s) runs 30B models well. M5 Max (128 GB, 460-614 GB/s) is the first Mac to run 70B at Q4_K_M β€” comparable to dual RTX 4090 desktops.

What CPU is best for local LLMs without a GPU?

High-core-count CPUs with large L3 cache: AMD Ryzen 9 7950X or Intel Core i9-14900K. Expect 5-15 tokens/sec for 7B models. CPU inference is 3-5Γ— slower than GPU.

Does storage speed affect local LLM performance?

Yes, at model load time. NVMe SSD (3-7 GB/s) loads a 7B model in 2-5 seconds vs. 20-60 seconds on HDD. Inference speed after loading is unaffected by storage.

Can I use multiple GPUs to run larger models?

Yes, via tensor parallelism. Two RTX 4090s (24 GB each) provide 48 GB VRAM for 70B models at FP16. Ollama and llama.cpp support multi-GPU via --n-gpu-layers split across cards.

What are the best local LLMs for 16 GB VRAM in 2026?

Mistral Small 3.1 24B Q4_K_M (13 GB, 55 tok/sec) is the best overall for RTX 4080 / RTX 5080 / RTX 4090 laptop. For agentic coding: Devstral Small 24B Q4_K_M (16 GB, 45 tok/sec). For reasoning: DeepSeek-R1 14B (15 GB, 40 tok/sec). Llama 3.3 70B does not fit -- it requires 39 GB at Q4_K_M.

Can a single RTX 4090 run a 70B model at good quality?

No -- not at Q4_K_M quality. Llama 3.3 70B at Q4_K_M requires ~39 GB VRAM. The RTX 4090 has 24 GB. You can run it at Q2_K (~24 GB) but quality drops noticeably. Better options: Qwen 3.6 27B Q4_K_M (~16 GB, 77.2% SWE-bench, best dense coding) or DeepSeek-R1 32B Q4_K_M (~19 GB, best reasoning).

What is the best local LLM for 16 GB system RAM without a GPU?

Phi-4 Mini 3.8B Q4_K_M (2.5 GB RAM, ~25 tok/sec on Ryzen 9 7950X) is the best option for CPU-only inference on 16 GB system RAM. Gemma 2 2B Q8 is the fastest at ~28 tok/sec. Llama 3.1 8B Q4_K_M (4.9 GB) also fits but runs at ~12 tok/sec -- slow for interactive use.

Sources

A Note on Third-Party Facts

This article references third-party AI models, benchmarks, prices, and licenses. The AI landscape changes rapidly. Benchmark scores, license terms, model names, and API prices can shift between the time of writing and the time you read this. Before making deployment or compliance decisions based on this article, verify current figures on each provider's official source: Hugging Face model cards for licenses and benchmarks, provider websites for API pricing, and EUR-Lex for current GDPR and EU AI Act text. This article reflects publicly available information as of May 2026.

Compare your local LLM against 25+ cloud models simultaneously with PromptQuorum.

Join the PromptQuorum Waitlist β†’

← Back to Local LLMs

Local LLM Hardware in 2026: GPU vs Mini PC vs Mac Compared