重要なポイント
- Disable logging/debugging (easy): ~10% speed gain.
- Use Q4 quantization (easy): Same speed, smaller VRAM.
- Optimize batch size (medium): 2–3× speed for batch processing.
- Use vLLM instead of Ollama (hard): 2–5× speed for concurrent requests.
- GPU memory utilization 90%+ (medium): 15–20% speed gain.
- Combining all techniques: ~2–3× total speedup.
Batch Size: The Multiplier for Throughput
For batch processing (multiple prompts), increasing batch size dramatically improves throughput.
Single request = limited pipeline utilization. Batch 32 requests = 2–4× throughput.
Trade-off: Higher latency per individual request (they wait for batch to complete).
| Batch Size | Throughput | Latency/Request | Use Case |
|---|---|---|---|
| 1 (single) | 50 tokens/sec | Minimum | Real-time chat |
| 8 | 120 tokens/sec | Acceptable | Light concurrency |
| 32 | 200 tokens/sec | High | Batch API |
| 64+ | 250+ tokens/sec | Very high | Offline batch |
Inference Engine Selection and Tuning
vLLM: 5–10× faster than Ollama for batch processing (concurrent requests).
llama.cpp: Fastest for single requests on consumer hardware.
Text-Generation-WebUI: Slower, but more features for experimentation.
For production APIs, vLLM is optimal.
Quantization Impact on Speed
Q4 and Q5 are approximately the same speed as FP16 on modern GPUs. Older GPUs may benefit from quantization speed-ups.
Benefits of quantization for speed:
- Smaller model file = faster loading
- Reduced memory bandwidth = slightly faster (10–15%) on some hardware
Quantization is primarily for VRAM reduction, not speed.
Realistic Speed Improvements With Tuning
Example: Optimizing a 7B model on RTX 4090:
| Change | Speed | Cumulative Gain |
|---|---|---|
| — | 120 tokens/sec | — |
| — | 132 tokens/sec | — |
| — | 150 tokens/sec | — |
| — | 300 tokens/sec (batch) | — |
| — | 300 tokens/sec | — |
Common Speed Optimization Mistakes
- Pushing GPU memory to 100%. Risks out-of-memory crashes. Safe max is 90–95%.
- Lowering batch size for speed. Batch size does not affect single-request latency. Only helps throughput.
- Over-quantizing for speed. Q4 is roughly the same speed as FP16. Quantize for VRAM, not speed.
- Changing inference engine mid-deployment. Switching Ollama → vLLM → llama.cpp introduces bugs. Pick one, optimize it.
Sources
- vLLM Optimization Guide — docs.vllm.ai/en/dev_guide/performance_tuning.html
- Ollama Performance Tips — github.com/ollama/ollama/blob/main/docs/troubleshooting.md