PromptQuorumPromptQuorum
Home/Local LLMs/How to Run 70B Models on 24GB VRAM: Advanced Techniques
Hardware & Performance

How to Run 70B Models on 24GB VRAM: Advanced Techniques

Β·10 min readΒ·By Hans Kuepper Β· Founder of PromptQuorum, multi-model AI dispatch tool Β· PromptQuorum

Running a 70B model (normally requires 40+ GB) on 24 GB VRAM is possible with aggressive quantization (Q2–Q3) and layer offloading, but the result is slow (~3–5 tokens/sec). As of April 2026, this is impractical for real-time chat but viable for batch processing or experimentation.

Key Takeaways

  • Llama 3.1 70B at Q4 = 35 GB (too large for 24GB). At Q3 = 26 GB (still too large). At Q2 = 17 GB (fits!).
  • Trade-off: Q2 has noticeable quality loss. ~70% of FP16 quality.
  • Speed: 3–5 tokens/sec with 20 GB offloaded to system RAM (ultra-slow).
  • Better option: Use 13B model at Q5, or buy a second GPU for layer splitting.
  • As of April 2026, this is a constraint workaround, not a recommended approach.

The Theoretical VRAM Math

Llama 3.1 70B at various quantizations:

QuantizationModel SizeFits 24GB?
FP16 (baseline)β€”No
Q8 (8-bit)β€”No
Q5 (5-bit)β€”No
Q4 (4-bit)β€”No (with offloading: maybe)
Q3 (3-bit)β€”No (barely)
Q2 (2-bit)β€”Yes

Aggressive Quantization: The Primary Tool

To fit 70B in 24GB, you must use Q2 or Q3 quantization.

- Q3: 26 GB (still 2 GB over). Can offload 2 GB to RAM. Slightly better quality than Q2.

- Q2: 17.5 GB (fits!). 70% quality vs FP16. Noticeable degradation but usable.

Download the quantized model: `ollama pull llama3.1:70b-q2` (if available) or use conversion tools like llama.cpp.

Offloading to System RAM

If using Q4 (35 GB) on 24GB GPU, you can offload the remaining 11 GB to system RAM. Speed penalty is severe (10Γ— slower).

Only practical for batch processing where you can wait hours for results.

Practical Setup: Running 70B on 24GB

Step-by-step:

  1. 1Use Q2 quantization: `ollama pull llama3.1:70b-q2` (if available, else convert with llama.cpp)
  2. 2Verify VRAM: `nvidia-smi` should show ~18 GB used
  3. 3Run the model: `ollama run llama3.1:70b-q2`
  4. 4Expect 3–5 tokens/sec (very slow)
  5. 5Use only for batch/offline processing, not interactive chat

Realistic Performance Expectations

Running 70B on 24GB VRAM is slow:

QuantizationSpeedLatencyUse Case
Q2 (24GB VRAM)5–8 tok/sec2–4 sec per tokenBatch processing only
Q3 + offload (24GB)3–5 tok/sec3–5 sec per tokenExtremely limited
Q4 + offload (24GB)1–3 tok/sec5–10 sec per tokenOvernight batch only

Better Alternatives to Constrained 70B

Instead of struggling with 70B on limited VRAM, consider:

  • Use a 13B model (Llama 3.1 13B at Q5 = 8 GB, very fast)
  • Buy a second RTX 4090 for layer splitting (2Γ— 24GB = 48GB, 100+ tok/sec)
  • Use a cloud API (GPT-4o for important tasks, local for experimentation)
  • Wait for more efficient models (smaller, same quality)

Common Mistakes With Constrained 70B

  • Expecting Q2 to be usable for chat. It is not. Quality degradation is too severe for real-time interaction.
  • Not measuring actual speed before committing. Test with a small prompt (10 tokens) and verify speed before running large batch jobs.
  • Assuming offloading is "free". System RAM is 100Γ— slower than GPU VRAM. Offloading makes inference impractical.
  • Not considering alternatives. A 13B model is dramatically faster and often sufficient in quality.

Sources

  • llama.cpp Quantization β€” github.com/ggerganov/llama.cpp/blob/master/gguf-py/gguf/quants.py
  • Model Card: Llama 3.1 70B β€” huggingface.co/meta-llama/Llama-3.1-70B

Compare your local LLM against 25+ cloud models simultaneously with PromptQuorum.

Try PromptQuorum free β†’

← Back to Local LLMs

Run 70B Models 24GB VRAM | PromptQuorum