PromptQuorumPromptQuorum
Home/Local LLMs/Multi-GPU Local LLMs 2026: Run 70B Models Across 2+ GPUs with vLLM and Ollama
Hardware & Performance

Multi-GPU Local LLMs 2026: Run 70B Models Across 2+ GPUs with vLLM and Ollama

Β·11 min readΒ·By Hans Kuepper Β· Founder of PromptQuorum, multi-model AI dispatch tool Β· PromptQuorum

Dual RTX 4090s (48 GB combined) run Llama 3.3 70B at ~100 tok/sec β€” only 5–10% slower than a theoretical single 48 GB GPU. This is the most cost-effective multi-GPU setup for 70B models in 2026.

Using multiple GPUs lets you run 70B+ models that don't fit in a single GPU's VRAM. Dual RTX 4090s (48 GB combined) run Llama 3.3 70B at Q4 at ~100 tok/sec β€” only 5–10% slower than a theoretical single 48 GB GPU due to inter-GPU communication overhead. As of April 2026, vLLM (tensor parallelism) and Ollama (automatic layer splitting) both support multi-GPU out of the box. NVLink reduces overhead to 3–5% but is unavailable on consumer RTX cards β€” PCIe 4.0/5.0 is sufficient for most dual-GPU setups.

Slide Deck: Multi-GPU Local LLMs 2026: Run 70B Models Across 2+ GPUs with vLLM and Ollama

The slide deck below covers: how dual RTX 4090s (48 GB combined) run Llama 3.3 70B at 100 tok/sec with only 5-10% overhead, vLLM tensor parallelism setup (--tensor-parallel-size 2), Ollama automatic GPU splitting, NVLink vs PCIe bandwidth comparison (900 GB/sec vs 64 GB/sec), an 8-row GPU performance table, and 5 common multi-GPU mistakes to avoid. Download the PDF as a multi-GPU LLM inference reference card.

Browse the slides below or download as PDF for offline reference. Download Reference Card (PDF)

Key Takeaways

  • Multi-GPU: Split a large model across 2+ GPUs. Example: 70B model split evenly across 2Γ— RTX 4090 = 48 GB total VRAM.
  • Speed penalty: ~5-10% slower than single GPU (GPU-to-GPU communication overhead).
  • Best for: 70B models, high-concurrency services (50+ simultaneous users).
  • Automatic: Modern tools (vLLM, Ollama, llama.cpp) auto-detect multiple GPUs.
  • As of April 2026, this is standard for production deployments.

How Layer Splitting and Tensor Parallelism Work?

A 70B Transformer model has 80 layers. With layer splitting, Ollama might place:

- GPU 1: Layers 1-40

- GPU 2: Layers 41-80

When a token is generated, it flows through GPU 1, then GPU 2, then back for next token. Minimal communication overhead.

Layer splitting across 2 GPUs: 80-layer 70B model distributed (layers 1–40 on GPU 1, layers 41–80 on GPU 2), with PCIe inter-GPU communication adding ~10% overhead (~100 tok/sec on dual RTX 4090).
Layer splitting across 2 GPUs: 80-layer 70B model distributed (layers 1–40 on GPU 1, layers 41–80 on GPU 2), with PCIe inter-GPU communication adding ~10% overhead (~100 tok/sec on dual RTX 4090).

β€’πŸ’‘: Pro Tip: Layers are lightweight β€” what matters is GPU-to-GPU communication speed. Layer 1–40 on GPU1, layer 41–80 on GPU2 means one GPU transfer per token. This is why NVLink matters.

Multi-GPU Setup With vLLM

vLLM supports tensor parallelism out-of-the-box with a single command. Use the `--tensor-parallel-size` flag to specify the number of GPUs:

bash
# Run 70B model across 2 GPUs
vllm serve meta-llama/Llama-3.1-70B \
  --tensor-parallel-size 2 \
  --gpu-memory-utilization 0.95 \
  --port 8000

# API is now at http://localhost:8000/v1
# Same API, automatic multi-GPU handling
vLLM 4-step multi-GPU setup: verify both GPUs visible (nvidia-smi), install vLLM, launch with --tensor-parallel-size 2 flag, verify both GPUs loaded and achieving ~100 tok/sec throughput.
vLLM 4-step multi-GPU setup: verify both GPUs visible (nvidia-smi), install vLLM, launch with --tensor-parallel-size 2 flag, verify both GPUs loaded and achieving ~100 tok/sec throughput.

β€’βš οΈ: Warning: Both GPUs must have the same VRAM. If pairing RTX 4090 (24 GB) + RTX 4080 (16 GB), vLLM will be bottlenecked to 16 GB per GPU. Use matching GPUs for optimal performance.

Multi-GPU Setup With Ollama

Ollama auto-detects multiple GPUs and splits automatically:

1. Run Ollama normally: `ollama serve`

2. Ollama detects 2+ GPUs and automatically splits models

3. No configuration needed -- it just works.

Verify with `nvidia-smi` or `rocm-smi` to see both GPUs loading.

β€’πŸ› οΈ: Best Practice: Verify multi-GPU setup is working by running `nvidia-smi` and checking both GPU memory usage. If only one GPU is loaded, Ollama may not have detected the second GPU. Check driver versions and upgrade if needed.

Performance With 2 GPUs

SetupModelSpeedCost
1Γ— RTX 4090 (24GB)7B150 tok/sec$1,800
1Γ— RTX 4090 (24GB)70BCannot fit$1,800
2Γ— RTX 4090 (48GB)70B Q4100 tok/sec$3,600
2Γ— RTX 4090 (48GB)70B Q590 tok/sec$3,600
1Γ— RTX 5090 (32GB)70B Q440–50 tok/sec$2,000
2Γ— RTX 5090 (64GB)70B Q8120 tok/sec$4,000
2Γ— RTX 5090 (64GB)405B Q425–35 tok/sec$4,000
RTX 6000 Ada + RTX 409070B FP16110 tok/sec$6,800
8-row GPU performance comparison for 70B models: single RTX 4090 cannot fit 70B, dual RTX 4090 delivers 100 tok/sec ($3,600), RTX 5090 32GB runs 70B Q4 at 40–50 tok/sec ($2,000), dual RTX 5090 handles 405B Q4 at 25–35 tok/sec ($4,000).
8-row GPU performance comparison for 70B models: single RTX 4090 cannot fit 70B, dual RTX 4090 delivers 100 tok/sec ($3,600), RTX 5090 32GB runs 70B Q4 at 40–50 tok/sec ($2,000), dual RTX 5090 handles 405B Q4 at 25–35 tok/sec ($4,000).

β€’πŸ“Œ: Key Point: Two RTX 4090s deliver ~100 tok/sec on 70B models β€” roughly 90% of single-GPU speed due to 5–10% communication overhead. The RTX 5090 (32 GB GDDR7, launched January 2026) changed the equation: a single 5090 runs 70B Q4 without splitting at 40–50 tok/sec. Dual 5090s (64 GB combined) are the first consumer setup to handle 405B Q4 models.

When to Use Multi-GPU?

Multi-GPU is cost-effective when you need 70B+ models or high-concurrency services. Use multiple GPUs when:

  • You need to run 70B+ models.
  • You serve 50+ concurrent users (batch processing).
  • You want to run multiple 13B models simultaneously.
  • You run production services (not experimentation).
Multi-GPU decision matrix: use if running 70B+ models, serving 50+ concurrent users, or needing 100+ tok/sec for production; skip if not yet purchased 2nd GPU or doing experimentation.
Multi-GPU decision matrix: use if running 70B+ models, serving 50+ concurrent users, or needing 100+ tok/sec for production; skip if not yet purchased 2nd GPU or doing experimentation.

β€’πŸ’‘: Pro Tip: For experimentation with 70B models, try single-GPU CPU offloading first (8–10 tok/sec on RTX 4090). Once production demand is confirmed, invest in a second RTX 4090 for multi-GPU setup (100 tok/sec).

Common Multi-GPU Mistakes

  • Expecting 2Γ— speedup with 2 GPUs. You get ~90% of single-GPU speed (5-10% overhead from GPU communication).
  • Assuming GPUs must be identical. You can mix RTX 4090 + RTX 4080, but vLLM will be limited by the slower GPU.
  • Not using NVLink for communication. Without NVLink, multi-GPU communication is slower. NVLink is rare on consumer GPUs.
  • Forgetting about PCIe bandwidth. GPU-to-GPU communication goes through PCIe, which limits bandwidth (~16 GB/sec on PCIe 4.0).
  • Buying a second GPU before trying single-GPU options. Before investing $1,800+ in a second RTX 4090, try: (1) Q4 quantization instead of Q5/Q8 (halves VRAM), (2) CPU offloading via Ollama (8–10 tok/sec for 70B on single 4090), (3) RTX 5090 32 GB single-card (runs 70B Q4 without splitting for $2,000). Multi-GPU should be the last optimization, not the first.

β€’βš οΈ: Warning: Matching GPU models is essential for consistent performance. Mismatched GPUs (e.g., 4090 + 4080) create bottlenecks where the slower card dictates system speed. In production, always pair identical GPUs.

Frequently Asked Questions

β€’πŸ’¬: Did You Know? NVLink bandwidth (900 GB/sec) vs PCIe bandwidth (64 GB/sec) is the hidden factor in multi-GPU performance. A100/H100 professional GPUs with NVLink can achieve near-linear scaling (e.g., 2Γ— speedup with 2 GPUs). Consumer RTX cards are limited to PCIe, causing 5–10% overhead.

When should I use multiple GPUs for local LLMs?

Use multiple GPUs when a single GPU lacks VRAM for your target model. Two RTX 4090s (48 GB combined) run 70B models at Q5 quantization at ~100 tokens/sec. Single GPU with offloading achieves only 8–10 tokens/sec for the same model. Multi-GPU is cost-effective for 70B+ models when you already have or can acquire a second GPU.

How does vLLM tensor parallelism work across GPUs?

vLLM splits model layers across GPUs using tensor parallelism (`--tensor-parallel-size 2`). Each GPU holds half the model's weight matrices; computations happen in parallel with results communicated via NVLink or PCIe. NVLink (NVLink 4.0: 900 GB/sec bidirectional) is significantly faster than PCIe (64 GB/sec) for inter-GPU communication.

Does NVLink make a significant difference for LLM inference?

NVLink improves throughput by 10–30% vs PCIe for large models requiring frequent GPU-to-GPU communication. For 70B models split across two GPUs, NVLink reduces communication overhead from ~15% to ~3–5%. Consumer RTX cards use PCIe; NVLink is available on professional A100/H100 GPUs. For home use, PCIe is sufficient.

Can I mix different GPU models (e.g., RTX 4090 + RTX 4080) for layer splitting?

Technically yes β€” vLLM and llama.cpp support mixed GPU setups. In practice, the slower GPU bottlenecks the pair. A 4090+4080 pair performs closer to two 4080s than two 4090s. Matching GPU models is strongly recommended for production deployments.

How many GPUs do I need for 70B and 405B models?

70B at Q4: fits in 2Γ— RTX 4090 (35 GB needed, 48 GB available). 70B at Q8: needs 4Γ— RTX 4090 (70 GB needed). 405B at Q4: needs 4Γ— RTX 4090 (200 GB needed β€” barely fits). For 405B, professional A100 80GBΓ—4 (320 GB combined) is the recommended platform.

What is the speed penalty for layer splitting vs a single GPU?

Layer splitting adds 5–10% overhead from inter-GPU communication. Two RTX 4090s running a 70B model achieve ~100 tokens/sec β€” roughly 90% of what a single theoretical 48 GB GPU would achieve. This is far better than CPU offloading (8–10 tokens/sec) or a single 4090 running an impossible 70B model.

Can I run 70B on a single RTX 5090 without multi-GPU?

Yes β€” the RTX 5090 (32 GB GDDR7, January 2026) fits Llama 3.3 70B at Q4_K_M (~40 GB with KV cache at short context, tight fit at 32 GB with 4K context). Performance: 40–50 tok/sec. For 70B at longer context (32K+) or higher quantization (Q5+), dual GPUs are still needed. The 5090 eliminated the need for multi-GPU for 70B Q4 at short context.

Is PCIe 5.0 worth it for multi-GPU LLM setups?

PCIe 5.0 doubles bandwidth to ~128 GB/sec vs 64 GB/sec on PCIe 4.0. For dual-GPU 70B inference, this reduces communication overhead from ~10% to ~6–7%. The improvement is noticeable but not transformative β€” NVLink (900 GB/sec) remains the only way to achieve near-linear scaling. For consumer builds, PCIe 5.0 motherboards are recommended if buying new, but upgrading from PCIe 4.0 solely for multi-GPU is not cost-effective.

Sources

A Note on Third-Party Facts

This article references third-party AI models, benchmarks, prices, and licenses. The AI landscape changes rapidly. Benchmark scores, license terms, model names, and API prices can shift between the time of writing and the time you read this. Before making deployment or compliance decisions based on this article, verify current figures on each provider's official source: Hugging Face model cards for licenses and benchmarks, provider websites for API pricing, and EUR-Lex for current GDPR and EU AI Act text. This article reflects publicly available information as of May 2026.

Compare your local LLM against 25+ cloud models simultaneously with PromptQuorum.

Join the PromptQuorum Waitlist β†’

← Back to Local LLMs

Multi-GPU Local LLMs 2026: Dual RTX 4090 for 70B at 100 tok/s