PromptQuorumPromptQuorum
Home/Local LLMs/How to Double Local LLM Speed: Optimization Techniques
Hardware & Performance

How to Double Local LLM Speed: Optimization Techniques

Β·10 min readΒ·By Hans Kuepper Β· Founder of PromptQuorum, multi-model AI dispatch tool Β· PromptQuorum

Local LLMs can be 2–3Γ— faster with proper optimization. Techniques include: disabling logging, reducing batch size, optimizing quantization, using faster inference engines, and GPU memory tuning. As of April 2026, combining these techniques can achieve 2Γ— speed improvement with no quality loss.

Key Takeaways

  • Disable logging/debugging (easy): ~10% speed gain.
  • Use Q4 quantization (easy): Same speed, smaller VRAM.
  • Optimize batch size (medium): 2–3Γ— speed for batch processing.
  • Use vLLM instead of Ollama (hard): 2–5Γ— speed for concurrent requests.
  • GPU memory utilization 90%+ (medium): 15–20% speed gain.
  • Combining all techniques: ~2–3Γ— total speedup.

GPU Memory Utilization: The Hidden Speed Dial

By default, most tools use 70–80% of GPU VRAM. Increasing to 90%+ improves speed by 15–20%:

bash
# vLLM: increase GPU memory utilization
vllm serve meta-llama/Llama-2-7b-hf \
  --gpu-memory-utilization 0.95

# Ollama: environment variable
export OLLAMA_GPU_THRESHOLD=0.95  # Use 95% of GPU
ollama run llama3.2:3b

# LM Studio: Settings β†’ GPU acceleration slider (move to 100%)

Batch Size: The Multiplier for Throughput

For batch processing (multiple prompts), increasing batch size dramatically improves throughput.

Single request = limited pipeline utilization. Batch 32 requests = 2–4Γ— throughput.

Trade-off: Higher latency per individual request (they wait for batch to complete).

Batch SizeThroughputLatency/RequestUse Case
1 (single)50 tokens/secMinimumReal-time chat
8120 tokens/secAcceptableLight concurrency
32200 tokens/secHighBatch API
64+250+ tokens/secVery highOffline batch

Inference Engine Selection and Tuning

vLLM: 5–10Γ— faster than Ollama for batch processing (concurrent requests).

llama.cpp: Fastest for single requests on consumer hardware.

Text-Generation-WebUI: Slower, but more features for experimentation.

For production APIs, vLLM is optimal.

Quantization Impact on Speed

Q4 and Q5 are approximately the same speed as FP16 on modern GPUs. Older GPUs may benefit from quantization speed-ups.

Benefits of quantization for speed:

- Smaller model file = faster loading

- Reduced memory bandwidth = slightly faster (10–15%) on some hardware

Quantization is primarily for VRAM reduction, not speed.

Realistic Speed Improvements With Tuning

Example: Optimizing a 7B model on RTX 4090:

ChangeSpeedCumulative Gain
β€”120 tokens/secβ€”
β€”132 tokens/secβ€”
β€”150 tokens/secβ€”
β€”300 tokens/sec (batch)β€”
β€”300 tokens/secβ€”

Common Speed Optimization Mistakes

  • Pushing GPU memory to 100%. Risks out-of-memory crashes. Safe max is 90–95%.
  • Lowering batch size for speed. Batch size does not affect single-request latency. Only helps throughput.
  • Over-quantizing for speed. Q4 is roughly the same speed as FP16. Quantize for VRAM, not speed.
  • Changing inference engine mid-deployment. Switching Ollama β†’ vLLM β†’ llama.cpp introduces bugs. Pick one, optimize it.

Sources

  • vLLM Optimization Guide β€” docs.vllm.ai/en/dev_guide/performance_tuning.html
  • Ollama Performance Tips β€” github.com/ollama/ollama/blob/main/docs/troubleshooting.md

Compare your local LLM against 25+ cloud models simultaneously with PromptQuorum.

Try PromptQuorum free β†’

← Back to Local LLMs

Speed Up Local LLM Inference | PromptQuorum