Key Takeaways
- RTX 5090 is ~30-50% faster than RTX 4090 for local LLM inference on large models (70B Q4). On 7B-13B models the gap is smaller (~10-15%).
- RTX 5090 has 32GB GDDR7 vs RTX 4090's 24GB GDDR6X — 8GB extra VRAM matters for 34B Q8 and large-context 70B Q4 runs.
- RTX 5090 costs $1,000 more ($1,999 vs. $999 for used 4090). The price-to-performance gain doesn't justify upgrading if you already have a 4090.
- For 7B-13B models: 4090 is overkill. You'll hit CPU/cooling limits before maxing GPU.
- For 70B models: 5090 shines. Can run 2-3 smaller 70B models in parallel or single 70B at higher batch sizes.
- RTX 5080 ($999) often provides better value than 5090 for local LLMs unless you need dual-GPU setups.
What Are the Raw Speed Differences?
RTX 5090: 21,760 CUDA cores, 1,457 TFLOPS, ~1,792 GB/s memory bandwidth, 32GB GDDR7.
RTX 4090: 16,384 CUDA cores, 826 TFLOPS, ~1,008 GB/s memory bandwidth, 24GB GDDR6X.
Real-world LLM inference (Llama 3.3 70B, Q4, batch=1): RTX 5090 scores ~50-55 tokens/sec, RTX 4090 scores ~36 tokens/sec. ~40-50% faster on 70B models.
For 7B models (memory-bound but smaller weights): RTX 5090 scores ~90 tokens/sec, RTX 4090 scores ~75 tokens/sec. ~20% faster. The gap is smaller at smaller model sizes.
Does VRAM Matter Between 4090 and 5090?
RTX 5090 has 32GB GDDR7. RTX 4090 has 24GB GDDR6X. The 8GB difference matters for specific model sizes.
The VRAM advantage of the 5090 is real and meaningful: 34B Q8 (~28GB) fits comfortably in 32GB but is marginal in 24GB. 70B Q4 (~38-40GB) does NOT fit on either card alone — you need dual-GPU or Apple Silicon for that. For 7B-13B models, 24GB is more than enough.
Cost Per Token: Which Is Actually Cheaper?
- Used RTX 4090: ~$999-1,299. Achieves 36 tokens/sec on Llama 70B. Cost per token: $27-36 per M tokens.
- RTX 5090 new: $1,999. Achieves ~52 tokens/sec on Llama 3.3 70B Q4. Cost per token: ~$38 per M tokens.
- Verdict: 4090 is cheaper per token generated, not because it's faster, but because it's cheaper to buy.
When Should You Actually Upgrade from 4090 to 5090?
Never upgrade for 7B-13B inference. 4090 is overkill for these. You'll be CPU-bound or cooling-limited anyway.
Upgrade if: You regularly run 70B Q4 and need >40 tok/s, or you need 34B Q8 (~28GB VRAM) which fits in 5090's 32GB but is marginal in 4090's 24GB.
Better alternative: Add a second RTX 4090 for $1,200 instead of trading up to 5090. Two 4090s in tensor-parallel give you ~65-72 tokens/sec on 70B Q4, at a lower total cost than 2× 5090 ($4,000). Trade-off: more setup complexity.
Common Assumptions About the 5090
- Thinking 5090 is 2× faster than 4090 — it's 40-50% faster on 70B, only ~20% faster on 7B.
- Thinking both cards have the same VRAM — they don't. 5090 has 32GB, 4090 has 24GB. The 8GB difference matters for 34B Q8 models.
- Believing you need 5090 to run 70B models — you don't. 4090 runs Llama 3.3 70B Q4 at ~36 tokens/sec. That's sufficient for most local inference tasks.
Frequently Asked Questions
Is RTX 5090 worth it for running Llama 3.3 70B?
Depends. 4090 gives you ~36 tok/s on 70B Q4 which is usable. 5090 gives ~52 tok/s — 40% faster. The extra speed costs $1,000. Worth it if you're running 70B continuously; not worth it for occasional use.
Should I buy RTX 5090 or two RTX 4090s?
Two 4090s (~$2,500 used) beat 5090 ($1,999) on raw 70B speed (~72 tok/s combined). But two-GPU setups have complexity: tensor parallelism, driver overhead, space. 5090 is simpler and adds 32GB unified VRAM for 34B Q8 use cases.
Does RTX 5090 have better VRAM than 4090?
Yes. RTX 5090 has 32GB GDDR7, RTX 4090 has 24GB GDDR6X. The 8GB extra lets 5090 run 34B Q8 models (~28GB needed) without squeezing. Both cards cannot fit 70B Q4 (~38-40GB) without splitting across GPUs.
Is the RTX 5090 overkill for a 14B model?
Yes, for most use cases. A 14B Q4 model needs ~9GB VRAM and runs at ~55-65 tok/s on a 4090 — already fast. The 5090 would give ~65-75 tok/s, a marginal improvement. The 4090 or even a 3090 is more cost-effective for 14B inference.
Does the RTX 5090 laptop GPU have 32GB VRAM?
No. The RTX 5090 Laptop GPU has 24GB GDDR7 (reduced from the desktop's 32GB due to power and thermal constraints). It's significantly slower than the desktop 5090 but still faster than a 4090 on the same task.
Will 5090 prices drop like 4090 did?
Yes, eventually. 4090 was $1,499 at launch (2022), now $999 used (2026). Expect 5090 to hit $1,200-1,500 used in 2-3 years.
Can I use RTX 5090 with a 750W power supply?
Barely. RTX 5090 draws 575W alone. Pair with a 850W or 1000W PSU to avoid voltage sag under load.
Is RTX 5080 a better value than 5090?
Yes, for most. 5080 ($999) is 80% of 5090's speed at half the cost. For local LLMs, 5080 is the sweet spot.
How much faster is 5090 on multimodal models like Qwen-VL 70B?
Similar 20-25% lift. Multimodal compute is still memory-bound, so the bandwidth advantage of 5090 helps, but not dramatically.
Sources
- NVIDIA RTX 5090 and 4090 official specifications: CUDA cores, TFLOPS, memory bandwidth
- MLCommons MLPerf Inference Benchmark: Token generation speed on LLaMA 70B and Mistral models
- TechPowerUp GPU Database: RTX 5090 vs. 4090 power consumption and memory bandwidth comparison