Key Takeaways
- CPU: Threadripper 7980X (64-core, $3,000) or Intel Xeon W9 ($5,000+). Enables parallel fine-tuning while serving inference.
- GPU: 2Γ RTX 4090 24GB (used pair ~$2,200β2,600). 48GB total VRAM for multi-user 70B or single 70B + prep tasks.
- RAM: 128GB DDR5 ($600β800). Supports 8+ concurrent users on 70B or single-user 70B + quantization in parallel.
- Storage: 4β8TB NVMe SSD + 12β24TB HDD ($800β1,500). Multi-model library + backups + training datasets.
- PSU: 2Γ 1200W or 1Γ 2000W ($800β1,200). Dual 4090s draw 900W sustained; headroom for spikes essential.
- Cooling: Custom liquid loop or dual AIO ($1,000β2,000). Single large GPU + CPU = 1,200W heat output.
- Network: 10Gbps Ethernet optional ($200β400). LAN multi-user access without bottlenecking.
- Total: $4,000β6,000. Supports 8+ concurrent 70B users or 1 user fine-tuning + serving simultaneously.
Who Needs a $4Kβ6K Workstation?
This tier is for:
- SMBs/Enterprises: Running internal LLM API for 5+ employees simultaneously. On-prem data control required.
- AI researchers: Fine-tuning large models (70B LoRA) while serving inference to team. Single $2K rig can't parallelize.
- MLOps engineers: Building internal inference clusters. Start with one workstation as the server node.
- Content studios (serious): Running 24/7 video captioning, code generation, summarization without API costs.
What's the Workstation Parts List?
| Component | Model | Price (April 2026) | Notes |
|---|---|---|---|
| β | β | β | β |
| β | β | β | β |
| β | β | β | β |
| β | β | β | β |
| β | β | β | β |
| β | β | β | β |
| β | β | β | β |
| β | β | β | β |
| β | β | β | β |
Dual GPU Setup: Configuration & Scaling
Two RTX 4090s give you 48GB VRAM and ~2Γ throughput for inference.
- 1Side-by-side (no NVLink): Each GPU runs independently. Model A on GPU 0, Model B on GPU 1. Best for heterogeneous workloads (fine-tuning 7B + serving 70B).
- 2NVLink bridge: Fuse VRAM (48GB appears as single 48GB pool). Enables larger batch sizes or massive context windows. Cost: $200β300 for bridge + setup complexity.
- 3Dual-GPU inference: Shard a single 70B model across 2 GPUs for 2Γ throughput (28 tok/s instead of 14). Requires vLLM or llama.cpp tensor-parallel support.
How Do You Cool 1,200W of Heat?
RTX 4090 (450W) + RTX 4090 (450W) + CPU (200W) = 1,100W sustained, spikes to 1,300W.
- Custom liquid loop: $1,500β2,500. CPU water block + GPU water blocks + 360mm radiator. Keeps GPUs <75Β°C, CPU <80Β°C.
- Dual 360mm AIO: $600β900. One AIO per GPU + separate CPU cooler. More modular, easier maintenance than custom loop.
- Air cooling: Not viable. Thermal throttling guaranteed on sustained 70B inference.
Power Supply & Electrical Planning
Dual 4090s demand careful PSU selection.
- Option 1: Single 2000W PSU: Seasonic, Corsair, or EVGA 80+ Platinum. Cleaner cable routing, single point of failure.
- Option 2: Dual 1200W PSU: One PSU per GPU + shared motherboard. Redundancy (one fails, inference continues at 50% speed). Complex setup.
- Capacity rule: 2000W for dual 4090 is minimum. Anything less causes voltage sag under load.
- Circuit planning: A dual-GPU rig pulls 2000W at peak. Ensure 20A circuit (typical home/office outlet is 15A, insufficient). Use dedicated 240V line if available.
Multi-User Inference Performance
With 128GB RAM and dual 4090s:
- Single user, 70B model: 28 tokens/sec (2Γ 14 tok/s per GPU via tensor parallelism).
- Two concurrent users, 70B each: 14 tokens/sec per user (time-multiplexing requests).
- Four concurrent users, 7B each: 120 tokens/sec total (each user gets 30 tok/s).
- Fine-tuning 7B LoRA + serving 70B: Fine-tuning on GPU 0 (100W), inference on GPU 1 (450W). No interference.
Common Workstation Build Mistakes
- Buying two different GPU models (5090 + 4090). Asymmetry causes load balancing issues. Stick to identical cards.
- Skimping on PSU to save $300. A 1500W PSU + dual 4090s will throttle or crash under load.
- Using air cooling instead of liquid. Thermal throttling cuts throughput 30β50% on sustained inference.
FAQ
Is a Threadripper CPU necessary, or can I use Ryzen 9?
For inference alone: Ryzen 9 works fine. For inference + parallel fine-tuning: Threadripper's extra cores (64 vs. 16) are essential.
Should I use NVLink to fuse the two 4090s?
Optional. Skip it if running separate models on each GPU (7B + 70B). Use it if sharding a single 70B across both GPUs for higher batch sizes.
How many concurrent users can a dual-4090 rig handle?
For 70B: 2β3 users (each getting 14 tok/s). For 7B: 8+ users (each getting 30+ tok/s).
Can I upgrade to RTX 5090 instead of dual 4090?
Single 5090: Similar performance to dual 4090, half the VRAM (24GB vs. 48GB), $1,999. Dual 5090: $4,000 (overkill, worse value).
What's the ROI on a $5,000 workstation vs. cloud LLM API?
Cloud: $0.001 per 1K tokens. Workstation: $5,000 amortized over 2 years = $2,500/year, ~$0.000001 per token. Break-even at 2.5B tokens/year (light use).
Sources
- PCPartPicker: High-end workstation component pricing (April 2026)
- TechPowerUp: Threadripper & Xeon W power consumption & specifications
- NVIDIA NVLink documentation: GPU memory fusion and tensor-parallel inference