Key Takeaways
- llama.cpp: Fastest single-token latency (lowest ms/token). Best for interactive chat. Minimal dependencies.
- Ollama: Easiest to use. One command, auto-download models. Trade-off: 5-10% slower throughput than llama.cpp.
- vLLM: Highest throughput (tokens/sec) on batched requests. Best for production API servers. Steeper learning curve.
- Single-user chat: llama.cpp or Ollama (nearly identical speed).
- Multi-user API: vLLM (3-5Γ higher throughput).
- Casual use: Ollama (simplicity wins).
- All three produce identical model outputs β speed/throughput differ.
- Can run all three simultaneously on same machine (different ports). They don't conflict.
Speed Comparison Benchmarks β RTX 4090 24 GB
llama.cpp leads with 38 tok/s single-token; vLLM dominates at 250+ tok/s batched. Benchmarked on RTX 4090 24 GB, Llama 3.3 70B Q4_K_M, single request, April 2026:
| Backend | Tokens/sec | ms/token | VRAM Used | Batch Throughput |
|---|---|---|---|---|
| llama.cpp | 38 | 26 | 39 GB | N/A (no batching) |
| Ollama | 36 | 28 | 39 GB | N/A (single-batch) |
| vLLM | 34 | 29 | 41 GB | 250+ tok/s (continuous) |
Speed Comparison β RTX 3060 12 GB
Benchmarked on RTX 3060 12 GB, Llama 3.2 8B Q4_K_M, single request, April 2026:
| Backend | Tokens/sec | ms/token | VRAM Used | Batch Throughput |
|---|---|---|---|---|
| llama.cpp | 52 | 19 | 5.2 GB | N/A |
| Ollama | 48 | 21 | 5.4 GB | N/A |
| vLLM | 45 | 22 | 6.1 GB | 180 tok/s (batch=8) |
Feature Comparison Table
llama.cpp: best quantization & raw speed. Ollama: simplest installation. vLLM: best batching for production.
| Feature | llama.cpp | Ollama | vLLM |
|---|---|---|---|
| Setup time | 30 min (compile) | 5 min (one command) | 15 min (pip install) |
| OpenAI-compatible API | β (llama-server) | β (native) | β (native) |
| Model format | GGUF | GGUF | SafeTensors / HF |
| GPU support | CUDA, ROCm, Metal | CUDA, ROCm, Metal | CUDA only |
| Batching | β | β | β continuous |
| Multi-GPU | β | β | β tensor parallel |
| Apple Silicon | β Metal | β Metal | β |
| Chat UI | β (server only) | β (needs Open WebUI) | β (API only) |
| License | MIT | MIT | Apache 2.0 |
Batching & Throughput
vLLM processes 32+ requests in parallel; llama.cpp and Ollama handle one at a time. This is where vLLM dominates:
- llama.cpp: No native batching. One request at a time. Latency: 27ms/token. Throughput: 36 tok/s.
- Ollama: Single-batch only. Cannot process 2+ requests in parallel. Same throughput as llama.cpp.
- vLLM: Native continuous batching (dynamically handles concurrent requests). Processes 32 requests concurrently. Throughput: 250+ tok/s on same RTX 4090.
- vLLM's advantage multiplies with concurrent users. For API servers with 10+ users: vLLM is mandatory.
Setup Complexity
Ollama is simplest (5 min); vLLM requires Python (15 min); llama.cpp requires compilation (30 min). Here's the breakdown:
llama.cpp: Compile from source or download binary. Manual model file management. 30 min setup.
Ollama: `brew install ollama` or download installer. `ollama run llama3.2`. 5 min setup.
vLLM: `pip install vllm`, then `python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3.3-8B-Instruct`. 15 min setup (Python + dependencies).
Winner for simplicity: Ollama.
API Compatibility
All three now support OpenAI-compatible APIs; Ollama and vLLM are easiest.
llama.cpp: OpenAI-compatible API (via `llama-server`, added late 2024). Works with IDE extensions.
Ollama: OpenAI-compatible API (via `ollama serve` + client library). Works with most IDE extensions.
vLLM: OpenAI-compatible API (native `/v1/chat/completions`). Best compatibility.
For IDE integration (VS Code, Cursor): Ollama or vLLM. Skip llama.cpp.
When to Use Each?
llama.cpp: Minimal dependencies, raw speed. Use if building custom inference engine. Best for Mac (Metal acceleration).
Ollama: Everything-included simplicity. Use for chat UI + personal use. Works on Mac, Linux, Windows.
vLLM: Production API server. Use for multi-user deployments, high throughput requirements. Requires NVIDIA CUDA β does not run on Apple Silicon (M1/M2/M3/M4).
Common Mistakes When Choosing an Inference Backend
- Mistake: Assuming llama.cpp is always fastest. This is only true for single-token latency. vLLM wins on throughput for batch requests (7Γ faster with 10+ concurrent users).
- Mistake: Dismissing Ollama as slow. Ollama is only 5β10% slower than raw llama.cpp β a negligible difference for interactive chat where 34 tok/s feels instant.
- Mistake: Thinking you must pick one backend. You can run all three simultaneously on different ports. Use Ollama for personal chat, vLLM for your API server.
- Mistake: Using vLLM for single-user chat. vLLM's advantage is batching. For single-user interactive chat, Ollama's simpler setup wins.
Regional Context & Data Residency
EU/GDPR: All three backends run fully on-premises. No data leaves your infrastructure, satisfying GDPR Article 28 (no data processor agreement needed). Recommended for EU financial, healthcare, and legal workloads.
Japan/APPI: On-premises inference satisfies APPI requirements for sensitive personal data. vLLM is used in Japanese enterprise deployments for batch document processing.
China/Data Security Law (2021): Local inference avoids cross-border data transfer restrictions. llama.cpp and Ollama are commonly used in China with Qwen2.5 models.
FAQ
Which should I use as a beginner?
Ollama. One command, automatic model downloads, clean interface.
Which is fastest?
For single request: llama.cpp (~3% faster than Ollama). For 10 concurrent requests: vLLM (~7Γ faster).
Can I use llama.cpp instead of Ollama?
Yes, but more setup. Speed gain is negligible (3-5%) for most users.
Is vLLM production-ready?
Yes. Used in real deployments. Steeper learning curve, but worth it for high throughput.
Can I switch backends without retraining?
llama.cpp and Ollama use GGUF format (interchangeable). vLLM uses SafeTensors and requires model conversion.
Which backend is most stable?
Ollama (simple, fewer bugs). llama.cpp is stable too. vLLM updates frequently (more features, occasional breaking changes).
Does vLLM work on Mac?
No. vLLM requires NVIDIA CUDA. For Mac, use llama.cpp or Ollama with Metal acceleration.