重要なポイント
- Tier 1 (simple): `ollama run mistral` + OpenWebUI. No code required.
- Tier 2 (standard): vLLM + FastAPI wrapper. Python 3.10+, pip install 2 packages, 30 min setup.
- Tier 3 (production): vLLM + nginx load balancer + monitoring (Prometheus). Multi-GPU, multi-user, fault-tolerant.
- IDE integration: VS Code Copilot or Cursor with vLLM OpenAI API endpoint.
- Batch processing: Send 10 prompts at once, get 10 responses in parallel (not sequential).
- Cost: Zero (open source) vs. $20/mo (Claude Pro) or $200/mo (large team cloud).
- Speed: Tier 2 achieves 30–50 tok/s for coding. Tier 3 achieves 200+ tok/s across users.
- Complexity: Tier 1 (1/10), Tier 2 (4/10), Tier 3 (8/10).
The Three Tiers
Choose based on use case:
- Tier 1: Solo dev, casual chat, no API server. Ollama + chat UI.
- Tier 2: Single developer, IDE integration, custom scripts. vLLM + FastAPI.
- Tier 3: Team deployment, 5+ developers, always-on service. vLLM + nginx + monitoring.
Tier 1: CLI Quick Start (5 minutes)
For coding: install VS Code extension "Continue" (`continue.dev`), point to Ollama API, get completions in real-time.
- 1`brew install ollama` (macOS) or download Windows installer.
- 2`ollama run mistral` (downloads & runs 7B model).
- 3Open browser: `http://localhost:11434` (Ollama web UI).
- 4Start chatting. Done.
Tier 2: API Server with FastAPI (30 minutes)
Why FastAPI: OpenAI-compatible endpoint. Drop-in replacement for real OpenAI API in your code.
- 1Install Python 3.10+: `python --version`.
- 2Install vLLM: `pip install vllm torch`.
- 3Start vLLM server: `python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-2-7b-hf --port 8000`.
- 4Test endpoint: `curl http://localhost:8000/v1/chat/completions -d '{"model": "Llama-2-7b-hf", "messages": [{"role": "user", "content": "Write Python code for Fibonacci"}]}' -H "Content-Type: application/json"`.
- 5Integrate into IDE: point Copilot extension to `http://localhost:8000`.
- 6Batch requests: send multiple prompts in parallel, vLLM processes all at once.
Tier 3: Production Multi-User (2 hours)
Scales to 50+ concurrent developers (5 tok/s each) on dual-GPU rig. Cost: electricity only (~$100/month if 24/7).
- 1Deploy 2 vLLM instances on separate GPUs (GPU 0, GPU 1).
- 2Configure nginx to load-balance requests across both instances.
- 3Set up Prometheus for metrics collection (request latency, tokens/sec, errors).
- 4Add rate limiting per user (token bucket algorithm).
- 5Deploy on cloud VM or on-prem server with 10Gbps network.
- 6Monitor via Grafana dashboard (optional).
IDE Integration (VS Code, Cursor)
Setup for real-time code completions:
Alternative (native IDE support): Cursor Editor has built-in local LLM support (no extension needed).
- 1Install "Continue" extension (`continue.dev`).
- 2Open extension settings, configure custom API: `http://localhost:8000/v1` (vLLM endpoint).
- 3Set model name to match vLLM server (`meta-llama/Llama-2-7b-hf`).
- 4Press Ctrl+Shift+Space (or cmd+shift+space) to trigger completion.
- 5Completions stream in real-time (10–20 tok/s).
Debugging & Monitoring
- vLLM logs: Check stdout for errors (model loading, OOM, CUDA errors).
- Prometheus metrics: vLLM exports `/metrics` endpoint (request count, latency histogram, tokens generated).
- Token counting: Use `tiktoken` library to count tokens before sending (avoid OOM surprises).
- Latency profiling: Add timestamp logging before/after vLLM call to identify bottlenecks.
Common Setup Mistakes
- Running vLLM on same GPU as another process (Discord, gaming). Causes GPU out-of-memory errors.
- Sending requests with no timeout. If vLLM hangs, client hangs forever. Always set `timeout=60` in requests.
- Assuming vLLM auto-scales across multiple GPUs. Requires explicit `--tensor-parallel-size` flag.
- Forgetting to set CUDA_VISIBLE_DEVICES if multi-GPU. vLLM uses all GPUs by default.
FAQ
Which tier should I use?
Tier 1 if solo (casual use). Tier 2 if single dev + IDE integration. Tier 3 if team + 24/7 service.
Can I use vLLM instead of Ollama?
Yes, but more setup. vLLM is faster (batching) and more flexible (Python API).
How do I serve models across multiple GPUs?
vLLM: `--tensor-parallel-size 2`. Splits model across 2 GPUs for 2× throughput.
Can I fine-tune on top of vLLM inference?
No. Fine-tune separately (HuggingFace Transformers), then load fine-tuned model in vLLM.
What if vLLM OOMs?
Use smaller quantization (Q4 vs. Q8), lower batch size, or allocate less VRAM per model. Check `nvidia-smi`.
Is Tier 3 production-ready?
Yes, with monitoring. Add Prometheus, Grafana, alerting (Alertmanager). Standard infrastructure patterns.
Sources
- vLLM official documentation and OpenAI API compatibility guide
- FastAPI official documentation
- Prometheus metrics documentation for vLLM scrape config
- Continue.dev extension documentation