Wichtigste Erkenntnisse
- Small team (5β10): Single server (vLLM) + nginx + auth = $3K hardware, $50/mo electricity.
- Medium team (10β50): Dual-GPU cluster + load balancer + Prometheus monitoring = $6K hardware, $100/mo electricity.
- Large team (50+): Enterprise setup with redundancy, caching layer (Redis), auto-scaling = custom quote.
- Cost per user: $10β100/month depending on inference volume (vs. $200β500/month cloud APIs).
- Setup time: Single server = 1 day. Cluster = 1 week. Enterprise = 1 month (including security audit).
- API authentication: OAuth 2.0 (SSO via AD/Okta) for enterprise. Simple token auth for SMB.
- Usage tracking: Every query logged with user ID, timestamp, tokens generated (for cost attribution).
- Admin burden: Minimal (automated monitoring). Scaling event = add GPU card + rebalance (no code changes).
Architecture: Single Server vs Cluster
Single vLLM server (5β10 users):
- 1Γ RTX 4090 + 64GB RAM + 1TB SSD.
- Handles 10 concurrent users (5 tok/s each).
- Simple setup, single point of failure.
- Cost: $2,500 hardware + $50/mo electricity.
Dual-GPU cluster (10β50 users):
- 2Γ vLLM instances (one per GPU) + nginx load balancer.
- Handles 20 concurrent users (10 tok/s each).
- Automatic failover (if GPU 0 dies, GPU 1 stays up).
- Cost: $5,000 hardware + $100/mo electricity.
Redis caching layer (optional):
- Cache common prompts (system messages, templates).
- 30% latency reduction for repeated queries.
- Cost: $1K additional hardware.
User Access & Authentication
Simple auth (SMB < 50 users): API key per user. User sends `Authorization: Bearer $API_KEY` in request header.
Enterprise auth: OAuth 2.0 + SAML 2.0 integration with Okta/Azure AD. SSO login, automatic group assignment.
Rate limiting: Per-user token quota (e.g., 100K tokens/day). Prevents one team overusing the server.
Audit trail: Log every API call with user ID, IP, request size, response size, timestamp.
Cost Attribution & Metering
Track: Tokens generated per user per day. Sum across team for total cost.
Attribution: Allocate server cost proportionally (e.g., if Alice generates 40% of tokens, she gets 40% of bill).
Showback report: Monthly report per user: tokens used, estimated cloud API cost, internal cost, savings.
Tools: Prometheus + custom billing service. Or use open-source option: Metered.io (cloud-based cost tracking).
Team Size Scaling
5β10 users: 1Γ RTX 4090. Server: saturated when everyone runs inference simultaneously. Acceptable latency spikes.
10β30 users: 2Γ RTX 4090 (dual-GPU machine). Nginx load balancer spreads load. 20 concurrent = comfortable.
30β100 users: 3β4Γ GPU cluster (separate machines) + dedicated load balancer (hardware or software). Kubernetes optional.
100+ users: Enterprise architecture (cloud failover, cache layer, API gateway) = consider hybrid (local + cloud burst).
Monitoring & Troubleshooting
Prometheus metrics: vLLM exports request latency, tokens/sec, queue length. Scrape every 15 sec.
Grafana dashboard: Visualize queue depth, latency percentiles (p50, p99), GPU utilization.
Alerts: If latency > 2 sec or queue > 10 requests, page on-call engineer.
Logs: Centralize vLLM + nginx logs in ELK Stack. Search by user, timestamp, error.
Bottleneck identification: If GPU saturated (>90% utilization) and latency > 1 sec, add GPU. If CPU saturated, upgrade CPU.
Common Setup Mistakes
- Single point of failure (one GPU, no failover). GPU dies, team loses access. Use dual-GPU minimum.
- No rate limiting. One user runs 1M token inference, blocks everyone else. Implement token quotas.
- No audit logs. Can't track who accessed what data. Logging is mandatory for compliance teams.
FAQ
Can I add more users without buying new hardware?
Up to 20β30 per GPU. Beyond that, add GPU. 1 RTX 4090 handles ~5 tok/s per user concurrently.
How do I handle model updates (new Llama 3 variant)?
Download on separate machine, test, swap in. vLLM supports hot-swapping models with 0 downtime.
Should I use Kubernetes for team deployment?
Not needed for <50 users. Plain Docker + docker-compose is simpler. Kubernetes adds overhead.
Can I bill users based on tokens?
Yes, via showback reports. But decide policy first (shared cost vs. chargeback per dept).
What if a user accidentally deletes data on the server?
Backups. Run daily backup of all input/output logs to external storage. RAID 6 for redundancy.
Can I integrate with Slack/Teams for easy access?
Yes. Slack bot calls vLLM API, returns response. Popular integration: OpenAI API wrapper for Slack.
Sources
- vLLM official documentation: multi-user setup and rate limiting
- Prometheus documentation: metrics collection and alerting
- Kubernetes best practices (optional): container orchestration for large deployments