Points clΓ©s
- Llama 3.1 70B at Q4 = 35 GB (too large for 24GB). At Q3 = 26 GB (still too large). At Q2 = 17 GB (fits!).
- Trade-off: Q2 has noticeable quality loss. ~70% of FP16 quality.
- Speed: 3β5 tokens/sec with 20 GB offloaded to system RAM (ultra-slow).
- Better option: Use 13B model at Q5, or buy a second GPU for layer splitting.
- As of April 2026, this is a constraint workaround, not a recommended approach.
The Theoretical VRAM Math
Llama 3.1 70B at various quantizations:
| Quantization | Model Size | Fits 24GB? |
|---|---|---|
| FP16 (baseline) | β | No |
| Q8 (8-bit) | β | No |
| Q5 (5-bit) | β | No |
| Q4 (4-bit) | β | No (with offloading: maybe) |
| Q3 (3-bit) | β | No (barely) |
| Q2 (2-bit) | β | Yes |
Aggressive Quantization: The Primary Tool
To fit 70B in 24GB, you must use Q2 or Q3 quantization.
- Q3: 26 GB (still 2 GB over). Can offload 2 GB to RAM. Slightly better quality than Q2.
- Q2: 17.5 GB (fits!). 70% quality vs FP16. Noticeable degradation but usable.
Download the quantized model: `ollama pull llama3.1:70b-q2` (if available) or use conversion tools like llama.cpp.
Offloading to System RAM
If using Q4 (35 GB) on 24GB GPU, you can offload the remaining 11 GB to system RAM. Speed penalty is severe (10Γ slower).
Only practical for batch processing where you can wait hours for results.
Practical Setup: Running 70B on 24GB
Step-by-step:
- 1Use Q2 quantization: `ollama pull llama3.1:70b-q2` (if available, else convert with llama.cpp)
- 2Verify VRAM: `nvidia-smi` should show ~18 GB used
- 3Run the model: `ollama run llama3.1:70b-q2`
- 4Expect 3β5 tokens/sec (very slow)
- 5Use only for batch/offline processing, not interactive chat
Realistic Performance Expectations
Running 70B on 24GB VRAM is slow:
| Quantization | Speed | Latency | Use Case |
|---|---|---|---|
| Q2 (24GB VRAM) | 5β8 tok/sec | 2β4 sec per token | Batch processing only |
| Q3 + offload (24GB) | 3β5 tok/sec | 3β5 sec per token | Extremely limited |
| Q4 + offload (24GB) | 1β3 tok/sec | 5β10 sec per token | Overnight batch only |
Better Alternatives to Constrained 70B
Instead of struggling with 70B on limited VRAM, consider:
- Use a 13B model (Llama 3.1 13B at Q5 = 8 GB, very fast)
- Buy a second RTX 4090 for layer splitting (2Γ 24GB = 48GB, 100+ tok/sec)
- Use a cloud API (GPT-4o for important tasks, local for experimentation)
- Wait for more efficient models (smaller, same quality)
Common Mistakes With Constrained 70B
- Expecting Q2 to be usable for chat. It is not. Quality degradation is too severe for real-time interaction.
- Not measuring actual speed before committing. Test with a small prompt (10 tokens) and verify speed before running large batch jobs.
- Assuming offloading is "free". System RAM is 100Γ slower than GPU VRAM. Offloading makes inference impractical.
- Not considering alternatives. A 13B model is dramatically faster and often sufficient in quality.
Sources
- llama.cpp Quantization β github.com/ggerganov/llama.cpp/blob/master/gguf-py/gguf/quants.py
- Model Card: Llama 3.1 70B β huggingface.co/meta-llama/Llama-3.1-70B