关键要点
- Llama 3.1 70B at Q4 = 35 GB (too large for 24GB). At Q3 = 26 GB (still too large). At Q2 = 17 GB (fits!).
- Trade-off: Q2 has noticeable quality loss. ~70% of FP16 quality.
- Speed: 3–5 tokens/sec with 20 GB offloaded to system RAM (ultra-slow).
- Better option: Use 13B model at Q5, or buy a second GPU for layer splitting.
- As of April 2026, this is a constraint workaround, not a recommended approach.
The Theoretical VRAM Math
Llama 3.1 70B at various quantizations:
| Quantization | Model Size | Fits 24GB? |
|---|---|---|
| FP16 (baseline) | — | No |
| Q8 (8-bit) | — | No |
| Q5 (5-bit) | — | No |
| Q4 (4-bit) | — | No (with offloading: maybe) |
| Q3 (3-bit) | — | No (barely) |
| Q2 (2-bit) | — | Yes |
Aggressive Quantization: The Primary Tool
To fit 70B in 24GB, you must use Q2 or Q3 quantization.
- Q3: 26 GB (still 2 GB over). Can offload 2 GB to RAM. Slightly better quality than Q2.
- Q2: 17.5 GB (fits!). 70% quality vs FP16. Noticeable degradation but usable.
Download the quantized model: `ollama pull llama3.1:70b-q2` (if available) or use conversion tools like llama.cpp.
Offloading to System RAM
If using Q4 (35 GB) on 24GB GPU, you can offload the remaining 11 GB to system RAM. Speed penalty is severe (10× slower).
Only practical for batch processing where you can wait hours for results.
Practical Setup: Running 70B on 24GB
Step-by-step:
- 1Use Q2 quantization: `ollama pull llama3.1:70b-q2` (if available, else convert with llama.cpp)
- 2Verify VRAM: `nvidia-smi` should show ~18 GB used
- 3Run the model: `ollama run llama3.1:70b-q2`
- 4Expect 3–5 tokens/sec (very slow)
- 5Use only for batch/offline processing, not interactive chat
Realistic Performance Expectations
Running 70B on 24GB VRAM is slow:
| Quantization | Speed | Latency | Use Case |
|---|---|---|---|
| Q2 (24GB VRAM) | 5–8 tok/sec | 2–4 sec per token | Batch processing only |
| Q3 + offload (24GB) | 3–5 tok/sec | 3–5 sec per token | Extremely limited |
| Q4 + offload (24GB) | 1–3 tok/sec | 5–10 sec per token | Overnight batch only |
Better Alternatives to Constrained 70B
Instead of struggling with 70B on limited VRAM, consider:
- Use a 13B model (Llama 3.1 13B at Q5 = 8 GB, very fast)
- Buy a second RTX 4090 for layer splitting (2× 24GB = 48GB, 100+ tok/sec)
- Use a cloud API (GPT-4o for important tasks, local for experimentation)
- Wait for more efficient models (smaller, same quality)
Common Mistakes With Constrained 70B
- Expecting Q2 to be usable for chat. It is not. Quality degradation is too severe for real-time interaction.
- Not measuring actual speed before committing. Test with a small prompt (10 tokens) and verify speed before running large batch jobs.
- Assuming offloading is "free". System RAM is 100× slower than GPU VRAM. Offloading makes inference impractical.
- Not considering alternatives. A 13B model is dramatically faster and often sufficient in quality.
Sources
- llama.cpp Quantization — github.com/ggerganov/llama.cpp/blob/master/gguf-py/gguf/quants.py
- Model Card: Llama 3.1 70B — huggingface.co/meta-llama/Llama-3.1-70B