Is RTX 5090 worth it for running Llama 3 70B?

Only if you need 45+ tokens/sec. 4090 gives you 36, which is "good enough" for most. The extra 9 tokens/sec costs $1,000.

Home/Local LLMs/RTX 5090 vs RTX 4090 for Local LLM Inference

GPU Buying Guides

RTX 5090 vs RTX 4090 for Local LLM Inference

Last updated: June 2026·6 min·By Hans Kuepper · Founder of PromptQuorum, multi-model AI dispatch tool · PromptQuorum

Read in:

🇺🇸en 🇩🇪de 🇫🇷fr 🇯🇵ja 🇨🇳zh 🇪🇸es 🇧🇷pt 🇸🇦ar 🇰🇷ko

For local LLMs, RTX 5090 is 30-50% faster than RTX 4090 on large models and has 32GB GDDR7 (vs 4090's 24GB GDDR6X). As of June 2026, upgrade only if you need 70B+ inference speed or the 8GB extra VRAM to fit larger quantizations.

For local LLMs, RTX 5090 is 30-50% faster than RTX 4090 on 70B models and adds 8GB more VRAM (32GB vs 24GB), but costs $1,000 more. As of June 2026, the choice depends on whether you need the extra VRAM for larger models (5090 wins) or are happy with 7B-34B models (4090 is sufficient). If you already own a 4090, upgrading is hard to justify. If buying new, the RTX 5080 offers better performance-per-dollar.

Key Takeaways

RTX 5090 is ~30-50% faster than RTX 4090 for local LLM inference on large models (70B Q4). On 7B-13B models the gap is smaller (~10-15%).
RTX 5090 has 32GB GDDR7 vs RTX 4090's 24GB GDDR6X — 8GB extra VRAM matters for 34B Q8 and large-context 70B Q4 runs.
RTX 5090 costs $1,000 more ($1,999 vs. $999 for used 4090). The price-to-performance gain doesn't justify upgrading if you already have a 4090.
For 7B-13B models: 4090 is overkill. You'll hit CPU/cooling limits before maxing GPU.
For 70B models: 5090 shines. Can run 2-3 smaller 70B models in parallel or single 70B at higher batch sizes.
RTX 5080 ($999) often provides better value than 5090 for local LLMs unless you need dual-GPU setups.

What Are the Raw Speed Differences?

RTX 5090: 21,760 CUDA cores, 1,457 TFLOPS, ~1,792 GB/s memory bandwidth, 32GB GDDR7.

RTX 4090: 16,384 CUDA cores, 826 TFLOPS, ~1,008 GB/s memory bandwidth, 24GB GDDR6X.

Real-world LLM inference (Llama 3.3 70B, Q4, batch=1): RTX 5090 scores ~50-55 tokens/sec, RTX 4090 scores ~36 tokens/sec. ~40-50% faster on 70B models.

For 7B models (memory-bound but smaller weights): RTX 5090 scores ~90 tokens/sec, RTX 4090 scores ~75 tokens/sec. ~20% faster. The gap is smaller at smaller model sizes.

Does VRAM Matter Between 4090 and 5090?

RTX 5090 has 32GB GDDR7. RTX 4090 has 24GB GDDR6X. The 8GB difference matters for specific model sizes.

The VRAM advantage of the 5090 is real and meaningful: 34B Q8 (~28GB) fits comfortably in 32GB but is marginal in 24GB. 70B Q4 (~38-40GB) does NOT fit on either card alone — you need dual-GPU or Apple Silicon for that. For 7B-13B models, 24GB is more than enough.

Cost Per Token: Which Is Actually Cheaper?

Used RTX 4090: ~$999-1,299. Achieves 36 tokens/sec on Llama 70B. Cost per token: $27-36 per M tokens.
RTX 5090 new: $1,999. Achieves ~52 tokens/sec on Llama 3.3 70B Q4. Cost per token: ~$38 per M tokens.
Verdict: 4090 is cheaper per token generated, not because it's faster, but because it's cheaper to buy.

When Should You Actually Upgrade from 4090 to 5090?

Never upgrade for 7B-13B inference. 4090 is overkill for these. You'll be CPU-bound or cooling-limited anyway.

Upgrade if: You regularly run 70B Q4 and need >40 tok/s, or you need 34B Q8 (~28GB VRAM) which fits in 5090's 32GB but is marginal in 4090's 24GB.

Better alternative: Add a second RTX 4090 for $1,200 instead of trading up to 5090. Two 4090s in tensor-parallel give you ~65-72 tokens/sec on 70B Q4, at a lower total cost than 2× 5090 ($4,000). Trade-off: more setup complexity.

Common Assumptions About the 5090

Thinking 5090 is 2× faster than 4090 — it's 40-50% faster on 70B, only ~20% faster on 7B.
Thinking both cards have the same VRAM — they don't. 5090 has 32GB, 4090 has 24GB. The 8GB difference matters for 34B Q8 models.
Believing you need 5090 to run 70B models — you don't. 4090 runs Llama 3.3 70B Q4 at ~36 tokens/sec. That's sufficient for most local inference tasks.

Frequently Asked Questions

Is RTX 5090 worth it for running Llama 3.3 70B?

Depends. 4090 gives you ~36 tok/s on 70B Q4 which is usable. 5090 gives ~52 tok/s — 40% faster. The extra speed costs $1,000. Worth it if you're running 70B continuously; not worth it for occasional use.

Should I buy RTX 5090 or two RTX 4090s?

Two 4090s (~$2,500 used) beat 5090 ($1,999) on raw 70B speed (~72 tok/s combined). But two-GPU setups have complexity: tensor parallelism, driver overhead, space. 5090 is simpler and adds 32GB unified VRAM for 34B Q8 use cases.

Does RTX 5090 have better VRAM than 4090?

Yes. RTX 5090 has 32GB GDDR7, RTX 4090 has 24GB GDDR6X. The 8GB extra lets 5090 run 34B Q8 models (~28GB needed) without squeezing. Both cards cannot fit 70B Q4 (~38-40GB) without splitting across GPUs.

Is the RTX 5090 overkill for a 14B model?

Yes, for most use cases. A 14B Q4 model needs ~9GB VRAM and runs at ~55-65 tok/s on a 4090 — already fast. The 5090 would give ~65-75 tok/s, a marginal improvement. The 4090 or even a 3090 is more cost-effective for 14B inference.

Does the RTX 5090 laptop GPU have 32GB VRAM?

No. The RTX 5090 Laptop GPU has 24GB GDDR7 (reduced from the desktop's 32GB due to power and thermal constraints). It's significantly slower than the desktop 5090 but still faster than a 4090 on the same task.

Will 5090 prices drop like 4090 did?

Yes, eventually. 4090 was $1,499 at launch (2022), now $999 used (2026). Expect 5090 to hit $1,200-1,500 used in 2-3 years.

Can I use RTX 5090 with a 750W power supply?

Barely. RTX 5090 draws 575W alone. Pair with a 850W or 1000W PSU to avoid voltage sag under load.

Is RTX 5080 a better value than 5090?

Yes, for most. 5080 ($999) is 80% of 5090's speed at half the cost. For local LLMs, 5080 is the sweet spot.

How much faster is 5090 on multimodal models like Qwen-VL 70B?

Similar 20-25% lift. Multimodal compute is still memory-bound, so the bandwidth advantage of 5090 helps, but not dramatically.

Sources

NVIDIA RTX 5090 and 4090 official specifications: CUDA cores, TFLOPS, memory bandwidth
MLCommons MLPerf Inference Benchmark: Token generation speed on LLaMA 70B and Mistral models
TechPowerUp GPU Database: RTX 5090 vs. 4090 power consumption and memory bandwidth comparison

A Note on Third-Party Facts

This article references third-party AI models, benchmarks, prices, and licenses. The AI landscape changes rapidly. Benchmark scores, license terms, model names, and API prices can shift between the time of writing and the time you read this. Before making deployment or compliance decisions based on this article, verify current figures on each provider’s official source: Hugging Face model cards for licenses and benchmarks, provider websites for API pricing, and EUR-Lex for current GDPR and EU AI Act text. This article reflects publicly available information as of May 2026.

Run PromptQuorum with a local LLM, your own API keys, or both — you pick the backend.

Join the PromptQuorum Waitlist →

← Back to Local LLMs