PromptQuorumPromptQuorum
ホーム/ローカルLLM/RTX 5090 vs RTX 4090 for Local LLM Inference
GPU Buying Guides

RTX 5090 vs RTX 4090 for Local LLM Inference

·6 min·Hans Kuepper 著 · PromptQuorumの創設者、マルチモデルAIディスパッチツール · PromptQuorum

For local LLMs, RTX 5090 is 20–25% faster than RTX 4090 but costs $1,000 more. As of April 2026, the choice depends on whether you're running 70B models (5090 wins) or 7B–13B models (4090 is overkill anyway). If you already own a 4090, upgrading is not cost-effective. If buying new, the RTX 5080 offers better performance-per-dollar.

重要なポイント

  • RTX 5090 is ~20–25% faster than RTX 4090 for local LLM inference (measured tokens/sec).
  • Both cards have 24GB VRAM—identical for LLM work. The speed advantage of 5090 comes from better memory bandwidth and shader efficiency.
  • RTX 5090 costs $1,000 more ($1,999 vs. $999 for used 4090). The price-to-performance gain doesn't justify upgrading if you already have a 4090.
  • For 7B–13B models: 4090 is overkill. You'll hit CPU/cooling limits before maxing GPU.
  • For 70B models: 5090 shines. Can run 2–3 smaller 70B models in parallel or single 70B at higher batch sizes.
  • RTX 5080 ($999) often provides better value than 5090 for local LLMs unless you need dual-GPU setups.

What Are the Raw Speed Differences?

RTX 5090: 14,080 CUDA cores, 568 TFLOPS, ~1,500 GB/sec memory bandwidth.

RTX 4090: 16,384 CUDA cores, 410 TFLOPS, ~936 GB/sec memory bandwidth.

Real-world LLM inference (Llama 3 70B, Q4, batch=1): RTX 5090 scores ~45 tokens/sec, RTX 4090 scores ~36 tokens/sec. 25% faster.

For 7B models (memory-bound, not compute-bound): RTX 5090 scores ~80 tokens/sec, RTX 4090 scores ~75 tokens/sec. Only 6.5% faster. The benefit nearly disappears.

Does VRAM Matter Between 4090 and 5090?

Both have 24GB GDDR7 (5090) / GDDR6X (4090). Identical VRAM capacity. No advantage.

GDDR7 on 5090 is faster per-byte. This is part of why 5090 pulls 20–25% speed. But for the LLM workloads we run, GDDR6X (4090) is sufficient.

Cost Per Token: Which Is Actually Cheaper?

  • Used RTX 4090: ~$999–1,299. Achieves 36 tokens/sec on Llama 70B. Cost per token: $27–36 per M tokens.
  • RTX 5090 new: $1,999. Achieves 45 tokens/sec on Llama 70B. Cost per token: $44 per M tokens.
  • Verdict: 4090 is cheaper per token generated, not because it's faster, but because it's cheaper to buy.

When Should You Actually Upgrade from 4090 to 5090?

Never upgrade for 7B–13B inference. 4090 is overkill for these. You'll be CPU-bound or cooling-limited anyway.

Upgrade if: You're running dual-GPU 70B inference (2× 4090 = $2,500 vs. 2× 5090 = $4,000), you need 45+ tokens/sec on 70B models, or you're bottlenecked by memory bandwidth on multi-batch workloads.

Better alternative: Add a second RTX 4090 for $1,200 instead of trading up to 5090. Two 4090s in parallel give you ~72 tokens/sec (not 90, but close enough at half the cost).

Common Assumptions About the 5090

  • Thinking 5090 is 2× faster than 4090—it's only 20–25% faster, and even less for 7B models.
  • Assuming VRAM difference exists—both are 24GB. Same capacity, similar performance for LLMs.
  • Believing you need 5090 to run 70B models—4090 runs them fine at 36 tokens/sec. That's "good enough" for most users.

FAQ

Is RTX 5090 worth it for running Llama 3 70B?

Only if you need 45+ tokens/sec. 4090 gives you 36, which is "good enough" for most. The extra 9 tokens/sec costs $1,000.

Should I buy RTX 5090 or two RTX 4090s?

Two 4090s (~$2,500 used) beat 5090 ($1,999) on speed and flexibility. You can run multiple models in parallel. 5090 is simpler setup, but more expensive.

Does RTX 5090 have better VRAM than 4090?

No. Both 24GB. GDDR7 is faster per-byte, but for LLMs, GDDR6X (4090) is sufficient.

Will 5090 prices drop like 4090 did?

Yes, eventually. 4090 was $1,499 at launch (2022), now $999 used (2026). Expect 5090 to hit $1,200–1,500 used in 2–3 years.

Can I use RTX 5090 with a 750W power supply?

Barely. RTX 5090 draws 575W alone. Pair with a 850W or 1000W PSU to avoid voltage sag under load.

Is RTX 5080 a better value than 5090?

Yes, for most. 5080 ($999) is 80% of 5090's speed at half the cost. For local LLMs, 5080 is the sweet spot.

How much faster is 5090 on multimodal models like Qwen-VL 70B?

Similar 20–25% lift. Multimodal compute is still memory-bound, so the bandwidth advantage of 5090 helps, but not dramatically.

Sources

  • NVIDIA RTX 5090 and 4090 official specifications: CUDA cores, TFLOPS, memory bandwidth
  • MLCommons MLPerf Inference Benchmark: Token generation speed on LLaMA 70B and Mistral models
  • TechPowerUp GPU Database: RTX 5090 vs. 4090 power consumption and memory bandwidth comparison

PromptQuorumで、ローカルLLMを25以上のクラウドモデルと同時に比較しましょう。

PromptQuorumを無料で試す →

← ローカルLLMに戻る

RTX 5090 vs RTX 4090: Which GPU for Local LLMs in 2026? | PromptQuorum