关键要点
- Multi-GPU: Split a large model across 2+ GPUs. Example: 70B model split evenly across 2× RTX 4090 = 48 GB total VRAM.
- Speed penalty: ~5–10% slower than single GPU (GPU-to-GPU communication overhead).
- Best for: 70B models, high-concurrency services (50+ simultaneous users).
- Automatic: Modern tools (vLLM, Ollama, llama.cpp) auto-detect multiple GPUs.
- As of April 2026, this is standard for production deployments.
How Layer Splitting and Tensor Parallelism Work
A 70B Transformer model has 80 layers. With layer splitting, Ollama might place:
- GPU 1: Layers 1–40
- GPU 2: Layers 41–80
When a token is generated, it flows through GPU 1, then GPU 2, then back for next token. Minimal communication overhead.
Multi-GPU Setup With vLLM
vLLM supports tensor parallelism out-of-the-box:
# Run 70B model across 2 GPUs
vllm serve meta-llama/Llama-3.1-70B \
--tensor-parallel-size 2 \
--gpu-memory-utilization 0.95 \
--port 8000
# API is now at http://localhost:8000/v1
# Same API, automatic multi-GPU handlingMulti-GPU Setup With Ollama
Ollama auto-detects multiple GPUs and splits automatically:
1. Run Ollama normally: `ollama serve`
2. Ollama detects 2+ GPUs and automatically splits models
3. No configuration needed — it just works.
Verify with `nvidia-smi` or `rocm-smi` to see both GPUs loading.
Performance With 2 GPUs
| Setup | Model | Speed | Cost |
|---|---|---|---|
| 1× RTX 4090 (24GB) | 7B | 150 tok/sec | $1800 |
| 1× RTX 4090 (24GB) | 70B | Impossible | $1800 |
| 2× RTX 4090 (48GB) | 70B Q4 | 100 tok/sec | $3600 |
| 2× RTX 4090 (48GB) | 70B Q5 | 90 tok/sec | $3600 |
| RTX 6000 Ada + RTX 4090 | 70B FP16 | 110 tok/sec | $6800 |
When to Use Multi-GPU
Multi-GPU is justified when:
- You need to run 70B+ models.
- You serve 50+ concurrent users (batch processing).
- You want to run multiple 13B models simultaneously.
- You run production services (not experimentation).
Common Multi-GPU Mistakes
- Expecting 2× speedup with 2 GPUs. You get ~90% of single-GPU speed (5–10% overhead from GPU communication).
- Assuming GPUs must be identical. You can mix RTX 4090 + RTX 4080, but vLLM will be limited by the slower GPU.
- Not using NVLink for communication. Without NVLink, multi-GPU communication is slower. NVLink is rare on consumer GPUs.
- Forgetting about PCIe bandwidth. GPU-to-GPU communication goes through PCIe, which limits bandwidth (~16 GB/sec on PCIe 4.0).
Sources
- vLLM Tensor Parallelism — docs.vllm.ai/en/serving/distributed_serving.html
- Ollama Multi-GPU — github.com/ollama/ollama/blob/main/docs/gpu.md