PromptQuorumPromptQuorum
ホーム/ローカルLLM/Best Local LLM Stack for Developers
Tools & Interfaces

Best Local LLM Stack for Developers

·10 min·Hans Kuepper 著 · PromptQuorumの創設者、マルチモデルAIディスパッチツール · PromptQuorum

Developers should use vLLM + FastAPI + VS Code Copilot extension for production-grade local LLM inference. As of April 2026, this stack enables real-time code completions, batch processing, and OpenAI API compatibility without vendor lock-in. Alternative (simpler): Ollama + llama.cpp CLI for one-off scripts.

重要なポイント

  • Tier 1 (simple): `ollama run mistral` + OpenWebUI. No code required.
  • Tier 2 (standard): vLLM + FastAPI wrapper. Python 3.10+, pip install 2 packages, 30 min setup.
  • Tier 3 (production): vLLM + nginx load balancer + monitoring (Prometheus). Multi-GPU, multi-user, fault-tolerant.
  • IDE integration: VS Code Copilot or Cursor with vLLM OpenAI API endpoint.
  • Batch processing: Send 10 prompts at once, get 10 responses in parallel (not sequential).
  • Cost: Zero (open source) vs. $20/mo (Claude Pro) or $200/mo (large team cloud).
  • Speed: Tier 2 achieves 30–50 tok/s for coding. Tier 3 achieves 200+ tok/s across users.
  • Complexity: Tier 1 (1/10), Tier 2 (4/10), Tier 3 (8/10).

The Three Tiers

Choose based on use case:

  • Tier 1: Solo dev, casual chat, no API server. Ollama + chat UI.
  • Tier 2: Single developer, IDE integration, custom scripts. vLLM + FastAPI.
  • Tier 3: Team deployment, 5+ developers, always-on service. vLLM + nginx + monitoring.

Tier 1: CLI Quick Start (5 minutes)

For coding: install VS Code extension "Continue" (`continue.dev`), point to Ollama API, get completions in real-time.

  1. 1`brew install ollama` (macOS) or download Windows installer.
  2. 2`ollama run mistral` (downloads & runs 7B model).
  3. 3Open browser: `http://localhost:11434` (Ollama web UI).
  4. 4Start chatting. Done.

Tier 2: API Server with FastAPI (30 minutes)

Why FastAPI: OpenAI-compatible endpoint. Drop-in replacement for real OpenAI API in your code.

  1. 1Install Python 3.10+: `python --version`.
  2. 2Install vLLM: `pip install vllm torch`.
  3. 3Start vLLM server: `python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-2-7b-hf --port 8000`.
  4. 4Test endpoint: `curl http://localhost:8000/v1/chat/completions -d '{"model": "Llama-2-7b-hf", "messages": [{"role": "user", "content": "Write Python code for Fibonacci"}]}' -H "Content-Type: application/json"`.
  5. 5Integrate into IDE: point Copilot extension to `http://localhost:8000`.
  6. 6Batch requests: send multiple prompts in parallel, vLLM processes all at once.

Tier 3: Production Multi-User (2 hours)

Scales to 50+ concurrent developers (5 tok/s each) on dual-GPU rig. Cost: electricity only (~$100/month if 24/7).

  1. 1Deploy 2 vLLM instances on separate GPUs (GPU 0, GPU 1).
  2. 2Configure nginx to load-balance requests across both instances.
  3. 3Set up Prometheus for metrics collection (request latency, tokens/sec, errors).
  4. 4Add rate limiting per user (token bucket algorithm).
  5. 5Deploy on cloud VM or on-prem server with 10Gbps network.
  6. 6Monitor via Grafana dashboard (optional).

IDE Integration (VS Code, Cursor)

Setup for real-time code completions:

Alternative (native IDE support): Cursor Editor has built-in local LLM support (no extension needed).

  1. 1Install "Continue" extension (`continue.dev`).
  2. 2Open extension settings, configure custom API: `http://localhost:8000/v1` (vLLM endpoint).
  3. 3Set model name to match vLLM server (`meta-llama/Llama-2-7b-hf`).
  4. 4Press Ctrl+Shift+Space (or cmd+shift+space) to trigger completion.
  5. 5Completions stream in real-time (10–20 tok/s).

Debugging & Monitoring

  • vLLM logs: Check stdout for errors (model loading, OOM, CUDA errors).
  • Prometheus metrics: vLLM exports `/metrics` endpoint (request count, latency histogram, tokens generated).
  • Token counting: Use `tiktoken` library to count tokens before sending (avoid OOM surprises).
  • Latency profiling: Add timestamp logging before/after vLLM call to identify bottlenecks.

Common Setup Mistakes

  • Running vLLM on same GPU as another process (Discord, gaming). Causes GPU out-of-memory errors.
  • Sending requests with no timeout. If vLLM hangs, client hangs forever. Always set `timeout=60` in requests.
  • Assuming vLLM auto-scales across multiple GPUs. Requires explicit `--tensor-parallel-size` flag.
  • Forgetting to set CUDA_VISIBLE_DEVICES if multi-GPU. vLLM uses all GPUs by default.

FAQ

Which tier should I use?

Tier 1 if solo (casual use). Tier 2 if single dev + IDE integration. Tier 3 if team + 24/7 service.

Can I use vLLM instead of Ollama?

Yes, but more setup. vLLM is faster (batching) and more flexible (Python API).

How do I serve models across multiple GPUs?

vLLM: `--tensor-parallel-size 2`. Splits model across 2 GPUs for 2× throughput.

Can I fine-tune on top of vLLM inference?

No. Fine-tune separately (HuggingFace Transformers), then load fine-tuned model in vLLM.

What if vLLM OOMs?

Use smaller quantization (Q4 vs. Q8), lower batch size, or allocate less VRAM per model. Check `nvidia-smi`.

Is Tier 3 production-ready?

Yes, with monitoring. Add Prometheus, Grafana, alerting (Alertmanager). Standard infrastructure patterns.

Sources

  • vLLM official documentation and OpenAI API compatibility guide
  • FastAPI official documentation
  • Prometheus metrics documentation for vLLM scrape config
  • Continue.dev extension documentation

PromptQuorumで、ローカルLLMを25以上のクラウドモデルと同時に比較しましょう。

PromptQuorumを無料で試す →

← ローカルLLMに戻る

Best Local LLM Stack for Developers: IDE, API, Streaming Setup | PromptQuorum