PromptQuorumPromptQuorum
主页/本地LLM/How Much VRAM Do You Need for Local LLMs?
GPU Buying Guides

How Much VRAM Do You Need for Local LLMs?

·7 min·Hans Kuepper 作者 · PromptQuorum创始人,多模型AI调度工具 · PromptQuorum

For 7B models, you need 8GB VRAM; for 13B–22B, 12–16GB; for 70B, 24GB minimum. As of April 2026, these numbers assume Q4 (4-bit) quantization. Full-precision (FP32) models need 2–3× more VRAM and are rarely practical on consumer GPUs. The formula is: Model size (billions) × 2 bytes (FP32) ÷ quantization factor.

关键要点

  • 7B models: 8GB minimum (Q4), 10GB comfortable (Q5), 14GB for Q8 full precision.
  • 13B models: 10GB minimum (Q4), 12–14GB comfortable (Q5), 16GB for Q8.
  • 70B models: 24GB minimum (Q4), 32GB+ for Q5/Q8 or multi-user setup.
  • Quantization (Q4, Q5, Q8) reduces VRAM by 50–75% vs. full precision (FP32).
  • Always over-allocate by 1–2GB for overhead (KV cache, optimizer state, system OS).
  • Batch size ≠ VRAM per inference. Single inference uses same VRAM regardless of batch (batch processes sequentially).
  • More VRAM doesn't speed up single-prompt inference. It only helps with multi-user/multi-request setups.

What Is the VRAM Formula for LLMs?

VRAM (GB) = (Model Size in Billions × 4 bytes × Quantization Factor)

- Model size: Number of parameters (7B, 13B, 70B, etc.)

- 4 bytes: FP32 precision (1 byte = 8 bits)

- Quantization factor: 1.0 (FP32), 0.5 (Q8), 0.25 (Q4)

Example: Llama 3 70B, FP32, no quantization:

70 billion × 4 bytes = 280GB. Impractical.

Llama 3 70B, Q4 (4-bit) quantization:

70 billion × 4 bytes × 0.25 = 70GB allocated, ~24GB used after compression.

How Much VRAM Does Each Model Size Need?

Model SizeFP32 (No Quantization)Q8 (8-bit)Q5 (5-bit)Q4 (4-bit)Recommended GPU

How Does Quantization Reduce VRAM Requirements?

Quantization reduces the number of bits needed to represent each model parameter.

- FP32 (32-bit float): Full precision. 1 parameter = 4 bytes. No loss. Slowest.

- Q8 (8-bit): 1 parameter = 1 byte. ~6% accuracy loss. 75% VRAM savings.

- Q5 (5-bit): 1 parameter = 0.625 bytes. ~2% accuracy loss. 84% VRAM savings.

- Q4 (4-bit): 1 parameter = 0.5 bytes. ~1% accuracy loss. 87.5% VRAM savings.

For most users, Q4 is the sweet spot: imperceptible accuracy loss, 87% smaller VRAM footprint.

As of April 2026, Q4 is standard. Q5 and Q8 are available if you have extra VRAM and want marginal quality gains.

What About Batch Size and Multi-User Inference?

Batch size affects throughput (tokens per second), not single-inference latency.

A single user prompting "What is 2+2?" uses the same VRAM whether batch size is 1 or 32.

Batch size = 32 means processing 32 prompts in parallel. This uses ~32× more VRAM, but generates 32 responses faster.

For single-user (typical local LLM usage): Batch size = 1. VRAM is model size + 1–2GB overhead.

For multi-user server: Allocate batch size × model VRAM. A 70B model at batch=4 needs ~96GB (24GB × 4).

Do You Need More VRAM Than the Model Size?

Yes. Beyond the model weights, add:

- KV cache (key-value cache for context): ~5–10% extra VRAM.

- Optimizer state (if fine-tuning): 2–4× model size (only relevant for training, not inference).

- System overhead (OS, drivers, Ollama/LM Studio runtime): ~1–2GB.

Rule: A 70B model Q4 (20GB) + KV cache (2GB) + system (2GB) = ~24GB allocated.

Always buy GPUs with at least 1–2GB headroom above theoretical minimums.

Common VRAM Misconceptions

  • More VRAM = faster inference. False. VRAM size doesn't affect speed. Memory bandwidth (GB/sec) does, and that's fixed per GPU.
  • Batch size = sequential token limit. False. Batch size = parallel requests. Single inference uses batch=1 regardless of VRAM size.
  • You need 24GB for any 70B model. False. Q4 needs 24GB. Q8 needs 48GB. Depends on quantization.

FAQ

Can I run Mistral 7B on a 6GB GPU?

Barely, at Q4 with tight overhead. Practically, no. Buy at least 8GB. You'll hit OOM errors with 6GB.

How much VRAM do I need for fine-tuning a 7B model?

For LoRA: 12–16GB. Full fine-tuning: 28GB+. Fine-tuning requires optimizer state (2–4× model VRAM), not just inference.

Is 12GB enough for Llama 3 13B?

At Q4, yes barely. At Q5 or Q8, no. 12GB is cutting it close. 16GB is comfortable.

Do I need 24GB for a 70B model?

At Q4, yes. At Q5+, no. Higher quantization (Q5, Q8) need 32GB+ for 70B.

Does increasing batch size reduce VRAM for single inference?

No. Single inference always uses batch=1 VRAM. Batch size only helps throughput (multi-user scenarios).

What's the best quantization for accuracy?

Q8 is nearly imperceptible loss. Q5 is ~2% loss. Q4 is ~1% loss. For most, Q4 is the sweet spot.

Can I offload some VRAM to CPU RAM?

Yes, via layer-splitting (NVLink). Llama.cpp and Ollama support this. Performance drops 30–50% but it works.

Sources

  • NVIDIA CUDA memory architecture and shared memory model documentation
  • Ollama and LM Studio official documentation: model VRAM requirements and quantization specs
  • llama.cpp project GitHub: quantization levels (Q4, Q5, Q8) and memory calculations

使用PromptQuorum将您的本地LLM与25+个云模型同时进行比较。

免费试用PromptQuorum →

← 返回本地LLM

VRAM Calculator: How Much GPU Memory for Local LLMs? | PromptQuorum