关键要点
- VRAM math: (Model size in GB) ÷ Quantization = VRAM needed. Example: 70B model at 4-bit = 70 ÷ 8 = 8.75 GB.
- 7B models: 8 GB VRAM (RTX 4070 Ti, RTX 5080, M3 Max Mac).
- 13B models: 12–16 GB VRAM (RTX 4080, RTX 5090, M4 Max Mac).
- 70B models: 40–48 GB VRAM (RTX 6000 Ada, 2× RTX 4090, A100 80GB).
- Budget: RTX 4070 is best value ($600, handles 7–13B models). RTX 4090 handles any single-GPU model ($1800).
- As of April 2026, GPU prices have stabilized; CPU/RAM are less critical than GPU VRAM for LLM speed.
How Do You Calculate VRAM Requirements?
VRAM requirements depend on three factors: model size (parameters), quantization (bits per weight), and inference mode.
Formula:
``` VRAM (GB) = (Model Size × Quantization Bits) ÷ 8 ```
Quantization values: FP16 = 16 bits, Q8 = 8 bits, Q5 = 5 bits, Q4 = 4 bits.
| Model | FP16 (best quality) | Q8 (excellent) | Q5 (good) | Q4 (good, smallest) |
|---|---|---|---|---|
| Llama 3.1 7B | — | — | — | — |
| Llama 3.1 13B | — | — | — | — |
| Llama 3.1 70B | — | — | — | — |
| Qwen2.5 32B | — | — | — | — |
What GPU Should You Buy?
As of April 2026, NVIDIA dominates local LLM performance. Here are tier recommendations:
| Tier | GPU | VRAM | Best For | Performance |
|---|---|---|---|---|
| Budget ($600) | RTX 4070 Ti / RTX 5070 | 12 GB | 7–13B models | Fast (80 tokens/sec) |
| Mid ($1200) | RTX 4080 / RTX 5080 | 16 GB | 13–30B models | Very fast (120 tokens/sec) |
| High ($1800) | RTX 4090 / RTX 5090 | 24 GB | Any 70B model | Extremely fast (150 tokens/sec) |
| Server ($3000+) | RTX 6000 Ada / A100 | 48+ GB | Multi-user, 70B+ | Production-grade |
What CPU and RAM Do You Need?
With a GPU, CPU and RAM are secondary. The GPU does the heavy lifting; CPU/RAM handle context preparation.
Minimum CPU: 8-core processor (Intel i7 12th gen, AMD Ryzen 5 5600X, or newer). Older CPUs add 20%+ latency.
RAM: 16 GB minimum (with GPU). If running without GPU, 32+ GB recommended. RAM does not directly limit model size when GPU is present.
Storage: 500 GB SSD for model files and OS. M.2 NVMe is preferred (faster model loading).
How Much Storage Do You Need?
Model files are large. A 7B model at 4-bit quantization is 4–5 GB. Plan accordingly:
- 500 GB SSD: OS + 1–2 small models (3B, 7B)
- 1 TB SSD: OS + 3–5 models (mix of 7B and 13B)
- 2 TB SSD: OS + 10+ models (various sizes)
- 4 TB NVMe RAID: Production setup, fast model loading
Budget Build Recommendations
Building a local LLM machine from scratch:
| Budget | GPU | CPU | RAM | Models | Cost |
|---|---|---|---|---|---|
| $1500 (entry) | RTX 4070 Ti | i7 13700 | 16 GB | 7–13B | Realistic |
| $2500 (solid) | RTX 4080 | i7 14700K | 32 GB | 13–30B | Recommended |
| $4000 (high-end) | 2× RTX 4090 | Ryzen 9 7950X | 128 GB | Any (70B+) | Overkill for personal |
Mac Hardware for Local LLMs
Apple Silicon (M-series) is surprisingly good for local LLMs. M3/M4 Max and Pro handle 7–13B models well.
| Mac | GPU Memory | Best For | Limitation |
|---|---|---|---|
| M3 MacBook Pro 16" | 18 GB unified | 7B models (fast) | Can run 13B slowly |
| M3 Max (Studio) | 36 GB unified | 13B models (good) | Shared CPU/GPU memory |
| M4 Max (coming 2026) | 40+ GB unified | 13–30B models | Not optimized for 70B |
Server Hardware vs Consumer Hardware
For production deployment, server-grade hardware is recommended:
- Consumer (RTX 4090): ~$1800, 24 GB VRAM, single-user, prone to thermal throttling under sustained load.
- Server (RTX 6000 Ada): ~$5000, 48 GB VRAM, designed for 24/7 use, better cooling, error correction.
- Recommendation: Start with RTX 4090. If running 70B models 24/7 for multiple users, upgrade to dual A100 or RTX 6000.
Common Mistakes in Hardware Planning
- Buying CPU-only when GPU is available. A $600 RTX 4070 Ti will outperform a $2000 CPU. GPU dominates LLM speed.
- Not accounting for VRAM overhead. Model file size + system overhead + context = total VRAM used. Always buy 25% more than the model size.
- Assuming all 70B models fit in 40GB VRAM. They do, barely, in Q4 (4-bit) quantization only. Q5 requires 45+ GB.
- Ignoring power supply and cooling. RTX 4090 draws 575W. Need a 1200W PSU and good case airflow.
- Thinking an old GPU will work. RTX 2080 is 10× slower than RTX 4070 Ti. Modern GPU architecture matters significantly.
Common Questions About Local LLM Hardware
Can I run a 70B model on a laptop?
Only with heavy quantization (Q2, 2-bit) and CPU fallback. Impractical. Laptops are suited for 7B models. For 70B, use a desktop with RTX 4090+.
Is RTX 4090 overkill for personal use?
Not if you run 70B models or multiple models simultaneously. For just 7B chat, RTX 4070 Ti suffices. RTX 4090 is future-proof if you want flexibility.
Should I buy RTX 5090 or wait for RTX 6090?
RTX 5090 is available (early 2026). RTX 6000 Ada server GPUs are also solid. Unless you have unlimited budget, RTX 5090 or 4090 are excellent.
How does quantization affect quality?
FP16 = 100% quality (baseline), Q8 = 99%, Q5 = 95%, Q4 = 90–95%. For most tasks, Q4 is indistinguishable from FP16.
Can I upgrade GPU later?
Yes. Start with RTX 4070 Ti now, upgrade to RTX 5090 in 2 years if needed. GPU is the most replaceable component.
Sources
- NVIDIA GPU Specifications — nvidia.com/en-us/geforce/graphics-cards/
- Apple Silicon Performance — apple.com/mac/m3/
- LLM VRAM Calculator — vram.asult.com (reference)
- Model Quantization Benchmarks — huggingface.co/docs/transformers