Key Takeaways
- RTX 4090 is the best single consumer GPU for local AI in 2026: 24 GB VRAM, ~1 TB/s bandwidth
- 70B Q4 models need 40+ GB VRAM — requires dual RTX 3090 or CPU offloading
- Ryzen 9 9950X (Zen 5, 16 cores) is the best CPU for fast CPU offloading of large layers
- DDR5-6000 at 64 GB minimum; 128 GB enables 70B CPU offloading at useful speeds
- PCIe Gen 4/5 NVMe loads a 7B model in under 2 seconds vs 10+ seconds on SATA
- All three builds use the same AM5 socket — upgrade GPU/RAM later without new motherboard
Tier 1: $1200 Budget AI Workstation
The $1200 budget build uses a used RTX 3090 (24 GB VRAM) as the core. It runs Llama 3.1 8B Q8 at 45–60 tok/s, Qwen2.5 14B Q8 at 20–28 tok/s, and Qwen2.5 32B Q4 at 12–18 tok/s entirely on GPU. The RTX 3090 draws 350 W — pair with a quality 850 W PSU.
- Models supported at full GPU speed: 7B (any quant), 13B (Q4/Q8), 14B (Q4/Q8), 30B (Q4)
- 70B support: CPU offloading required — ~5–8 tok/s, functional but not ideal
- Power draw: ~450 W peak (GPU 350 W + CPU 65 W + rest)
- Recommended PSU: Corsair RM850x or equivalent 80+ Gold
Tier 2: $2500 Recommended AI Workstation
The $2500 recommended build centers on the RTX 4090 (24 GB, ~1 TB/s memory bandwidth) paired with the AMD Ryzen 9 9950X (Zen 5, 16 cores). The 4090 is 30–40% faster than the 3090 per GB of VRAM and draws less power per token. This build handles 30B Q4 models fully on GPU and 70B models via CPU offloading at 10–15 tok/s with 64 GB RAM.
- Models supported at full GPU speed: 7B–30B (any quant), 32B (Q4 fits in 24 GB)
- 70B support: CPU offloading at 10–15 tok/s with 64 GB RAM; upgrade to 128 GB for 15–20 tok/s
- 7B Q4 speed: ~105–125 tok/s on Ollama
- 14B Q8 speed: ~48–60 tok/s
- 30B Q4 speed: ~28–38 tok/s
- Power draw: ~550 W peak (GPU 450 W + CPU 65 W + rest)
Tier 3: $5000 Professional 70B Workstation
The $5000 professional build targets 70B model inference at GPU speed (25–40 tok/s) using dual RTX 3090 GPUs for 48 GB total VRAM. The Ryzen Threadripper 7960X (24 cores, high memory bandwidth) accelerates CPU offloading for models that spill over 48 GB. With 256 GB DDR5, even 140B quantized models load entirely in RAM.
- Models supported at full GPU speed (48 GB total VRAM): 7B–70B Q4, 30B Q8
- 70B Q4 speed: 25–40 tok/s (both RTX 3090s active via tensor parallelism in Ollama)
- CPU offloading with 256 GB RAM: runs 140B+ models at 4–6 tok/s
- Dual GPU configuration: Ollama detects both GPUs automatically; no NVLink needed
- Power draw: ~900 W peak (2× GPU 700 W + CPU 350 W + rest)
- Recommended PSU: Seasonic PRIME TX-1600W or equivalent
Software Stack for Any Build
Once hardware is assembled, getting Ollama running takes under 10 minutes:
- 1Install Ubuntu 22.04 LTS or Windows 11 (Ubuntu preferred for CUDA stability)
- 2Install NVIDIA drivers 550+ from nvidia.com or
ubuntu-drivers autoinstall - 3Install Ollama:
curl -fsSL https://ollama.com/install.sh | sh - 4Pull a model:
ollama pull qwen2.5:14b-instruct-q8_0 - 5Run as network server:
OLLAMA_HOST=0.0.0.0 ollama serve - 6Install Open WebUI for browser UI:
docker run -d -p 3000:8080 --gpus all ghcr.io/open-webui/open-webui:cuda - 7Expose via Tailscale for secure remote access from any device
Performance Comparison Across All Three Builds
Should I build a workstation or rent cloud GPUs for running 70B models?
For regular use (2+ hours/day), build the workstation. A dedicated A40 48 GB on RunPod costs $0.44/hr — at 4 hours/day, that's $641/year. The $3000–4000 professional build pays for itself in 5–6 years vs cloud. For occasional use (under 1 hour/day), cloud is cheaper. See our cost calculator at /local-llms/local-llm-cost-calculator-build-vs-rent-2026.
Do I need NVLink to run Ollama across two GPUs?
No. Ollama uses CUDA tensor parallelism to split model layers across multiple GPUs via PCIe — no NVLink required. NVLink would increase inter-GPU bandwidth from ~32 GB/s (PCIe 4.0 x16) to ~600 GB/s, which matters for training but minimally for inference. The dual RTX 3090 setup works fully without NVLink.
Why not an RTX 4090 over dual RTX 3090 for the professional build?
VRAM is the deciding factor. Two RTX 3090s at 24 GB each = 48 GB total, enough for Llama 3.1 70B Q4 (~40 GB). A single RTX 4090 has only 24 GB — 70B Q4 does not fit without CPU offloading. For 70B inference at GPU speed, dual 3090s win on VRAM/dollar. For 30B and below, the RTX 4090 is faster per dollar.
Can I start with the budget build and upgrade to the recommended tier?
Yes — all three builds use the AM5 socket (Tier 1 and 2) or TRX50 (Tier 3). You can replace the RTX 3090 with an RTX 4090 later, or add a second GPU. RAM modules are compatible. The only incompatibility is Tier 1/2 (AM5) vs Tier 3 (TRX50) — those require a new motherboard and CPU if upgrading to Threadripper.
What power outlet do I need for the professional build?
The professional build (dual RTX 3090 + Threadripper) peaks at ~900 W from the wall. A standard 15A/120V US outlet supports ~1800 W — you are fine. European 16A/230V outlets support ~3680 W. Use a quality PSU (Seasonic, Corsair, be quiet!) with 80+ Platinum efficiency to minimize heat and power draw.