Skip to main content
PromptQuorumPromptQuorum
Home/Local LLMs/๊ฐœ๋ฐœ์ž๋ฅผ ์œ„ํ•œ ์ตœ๊ณ ์˜ ๋กœ์ปฌ LLM ์Šคํƒ (2026๋…„ 4์›”)
๋„๊ตฌ ๋ฐ ์ธํ„ฐํŽ˜์ด์Šค

๊ฐœ๋ฐœ์ž๋ฅผ ์œ„ํ•œ ์ตœ๊ณ ์˜ ๋กœ์ปฌ LLM ์Šคํƒ (2026๋…„ 4์›”)

ยท10๋ถ„ยทBy Hans Kuepper ยท Founder of PromptQuorum, multi-model AI dispatch tool ยท PromptQuorum

๊ฐœ๋ฐœ์ž๋Š” ํ”„๋กœ๋•์…˜ ์ˆ˜์ค€์˜ ๋กœ์ปฌ LLM ์ถ”๋ก ์„ ์œ„ํ•ด vLLM + FastAPI + VS Code Copilot ํ™•์žฅ์„ ์‚ฌ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. 2026๋…„ 4์›” ๊ธฐ์ค€์œผ๋กœ, ์ด ์Šคํƒ์€ ๋ฒค๋” ์ข…์† ์—†์ด ์‹ค์‹œ๊ฐ„ ์ฝ”๋“œ ์™„์„ฑ, ๋ฐฐ์น˜ ์ฒ˜๋ฆฌ, OpenAI API ํ˜ธํ™˜์„ฑ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

๊ฐœ๋ฐœ์ž๋Š” ํ”„๋กœ๋•์…˜ ์ˆ˜์ค€์˜ ๋กœ์ปฌ LLM ์ถ”๋ก ์„ ์œ„ํ•ด vLLM + FastAPI + VS Code Copilot ํ™•์žฅ์„ ์‚ฌ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. 2026๋…„ 4์›” ๊ธฐ์ค€์œผ๋กœ, ์ด ์Šคํƒ์€ ๋ฒค๋” ์ข…์† ์—†์ด ์‹ค์‹œ๊ฐ„ ์ฝ”๋“œ ์™„์„ฑ, ๋ฐฐ์น˜ ์ฒ˜๋ฆฌ, OpenAI API ํ˜ธํ™˜์„ฑ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ๋Œ€์•ˆ(๋” ๊ฐ„๋‹จํ•œ ๋ฐฉ๋ฒ•): ์ผํšŒ์„ฑ ์Šคํฌ๋ฆฝํŠธ์—๋Š” Ollama + llama.cpp CLI๋ฅผ ์‚ฌ์šฉํ•˜์‹ญ์‹œ์˜ค.

Slide Deck: ๊ฐœ๋ฐœ์ž๋ฅผ ์œ„ํ•œ ์ตœ๊ณ ์˜ ๋กœ์ปฌ LLM ์Šคํƒ (2026๋…„ 4์›”)

์•„๋ž˜ ์Šฌ๋ผ์ด๋“œ ๋ฑ์€ 3๋‹จ๊ณ„ ๋กœ์ปฌ LLM ๊ฐœ๋ฐœ์ž ์Šคํƒ(Ollama โ†’ vLLM API โ†’ ํ”„๋กœ๋•์…˜ ๋ฉ€ํ‹ฐ ์‚ฌ์šฉ์ž), VS Code ๋ฐ Cursor์™€์˜ IDE ํ†ตํ•ฉ, Prometheus๋ฅผ ์‚ฌ์šฉํ•œ ๋””๋ฒ„๊น… ๋ฐ ๋ชจ๋‹ˆํ„ฐ๋ง, ์ง€์—ญ๋ณ„ ์ปดํ”Œ๋ผ์ด์–ธ์Šค ๋งฅ๋ฝ์„ ๋‹ค๋ฃน๋‹ˆ๋‹ค. PDF๋ฅผ ๋กœ์ปฌ LLM ๊ฐœ๋ฐœ์ž ์Šคํƒ ์ฐธ์กฐ ์นด๋“œ๋กœ ๋‹ค์šด๋กœ๋“œํ•˜์‹ญ์‹œ์˜ค.

Browse the slides below or download as PDF for offline reference. Download Reference Card (PDF)

Key Takeaways

  • 1๋‹จ๊ณ„ (๊ฐ„๋‹จ): `ollama run llama3.2` + OpenWebUI. ์ฝ”๋“œ ๋ถˆํ•„์š”.
  • 2๋‹จ๊ณ„ (ํ‘œ์ค€): vLLM + FastAPI ๋ž˜ํผ. Python 3.10+, pip์œผ๋กœ ํŒจํ‚ค์ง€ 2๊ฐœ ์„ค์น˜, 30๋ถ„ ์„ค์ •.
  • 3๋‹จ๊ณ„ (ํ”„๋กœ๋•์…˜): vLLM + nginx ๋กœ๋“œ ๋ฐธ๋Ÿฐ์„œ + ๋ชจ๋‹ˆํ„ฐ๋ง(Prometheus). ๋ฉ€ํ‹ฐ GPU, ๋ฉ€ํ‹ฐ ์‚ฌ์šฉ์ž, ๊ฒฐํ•จ ํ—ˆ์šฉ.
  • IDE ํ†ตํ•ฉ: vLLM OpenAI API ์—”๋“œํฌ์ธํŠธ์™€ ํ•จ๊ป˜ VS Code Copilot ๋˜๋Š” Cursor ์‚ฌ์šฉ.
  • ๋ฐฐ์น˜ ์ฒ˜๋ฆฌ: ํ•œ ๋ฒˆ์— ํ”„๋กฌํ”„ํŠธ 10๊ฐœ ์ „์†ก, 10๊ฐœ์˜ ์‘๋‹ต์„ ๋ณ‘๋ ฌ๋กœ ์ˆ˜์‹ (์ˆœ์ฐจ ์ฒ˜๋ฆฌ ์•„๋‹˜).
  • ๋น„์šฉ: ๋ฌด๋ฃŒ(์˜คํ”ˆ์†Œ์Šค) ๋Œ€ ์›” $20(Claude Pro) ๋˜๋Š” ์›” $200(๋Œ€๊ทœ๋ชจ ํŒ€ ํด๋ผ์šฐ๋“œ).
  • ์†๋„: 2๋‹จ๊ณ„๋Š” ์ฝ”๋”ฉ์—์„œ ์ดˆ๋‹น 30-50 ํ† ํฐ ๋‹ฌ์„ฑ. 3๋‹จ๊ณ„๋Š” ์‚ฌ์šฉ์ž ์ „์ฒด์—์„œ ์ดˆ๋‹น 200+ ํ† ํฐ ๋‹ฌ์„ฑ.
  • ๋ณต์žก๋„: 1๋‹จ๊ณ„(1/10), 2๋‹จ๊ณ„(4/10), 3๋‹จ๊ณ„(8/10).

3๋‹จ๊ณ„ ๊ตฌ์„ฑ

์‚ฌ์šฉ ์‚ฌ๋ก€์— ๋”ฐ๋ผ ์„ ํƒํ•˜์‹ญ์‹œ์˜ค:

  • 1๋‹จ๊ณ„: ๊ฐœ์ธ ๊ฐœ๋ฐœ์ž, ์ผ๋ฐ˜์ ์ธ ์ฑ„ํŒ…, API ์„œ๋ฒ„ ๋ถˆํ•„์š”. Ollama + ์ฑ„ํŒ… UI.
  • 2๋‹จ๊ณ„: ๋‹จ์ผ ๊ฐœ๋ฐœ์ž, IDE ํ†ตํ•ฉ, ์ปค์Šคํ…€ ์Šคํฌ๋ฆฝํŠธ. vLLM + FastAPI.
  • 3๋‹จ๊ณ„: ํŒ€ ๋ฐฐํฌ, ๊ฐœ๋ฐœ์ž 5๋ช… ์ด์ƒ, ์ƒ์‹œ ์„œ๋น„์Šค. vLLM + nginx + ๋ชจ๋‹ˆํ„ฐ๋ง.

1๋‹จ๊ณ„: CLI ๋น ๋ฅธ ์‹œ์ž‘ (5๋ถ„)

์ฝ”๋”ฉ์šฉ: VS Code ํ™•์žฅ "Continue" (`continue.dev`)๋ฅผ ์„ค์น˜ํ•˜๊ณ , Ollama API์— ์—ฐ๊ฒฐํ•˜๋ฉด ์‹ค์‹œ๊ฐ„ ์ฝ”๋“œ ์™„์„ฑ์„ ๋ฐ›์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  1. 1
    `brew install ollama` (macOS) ๋˜๋Š” Windows ์„ค์น˜ ํ”„๋กœ๊ทธ๋žจ ๋‹ค์šด๋กœ๋“œ.
  2. 2
    `ollama run llama3.2` (8B ๋ชจ๋ธ ๋‹ค์šด๋กœ๋“œ ๋ฐ ์‹คํ–‰).
  3. 3
    ๋ธŒ๋ผ์šฐ์ € ์—ด๊ธฐ: `http://localhost:11434` (Ollama ์›น UI).
  4. 4
    ์ฑ„ํŒ… ์‹œ์ž‘. ์™„๋ฃŒ.

2๋‹จ๊ณ„: FastAPI API ์„œ๋ฒ„ (30๋ถ„)

FastAPI๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ์ด์œ : OpenAI ํ˜ธํ™˜ ์—”๋“œํฌ์ธํŠธ. ์ฝ”๋“œ์—์„œ ์‹ค์ œ OpenAI API์˜ ๋Œ€์ฒด ๋“œ๋กญ์ธ.

  1. 1
    Python 3.10+ ์„ค์น˜: `python --version`.
  2. 2
    vLLM ์„ค์น˜: `pip install vllm torch`.
  3. 3
    vLLM ์„œ๋ฒ„ ์‹œ์ž‘: `python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3.3-8B-Instruct --port 8000`.
  4. 4
    ์—”๋“œํฌ์ธํŠธ ํ…Œ์ŠคํŠธ: `curl http://localhost:8000/v1/chat/completions -d '{"model": "Llama-3.1-8B-Instruct", "messages": [{"role": "user", "content": "Write Python code for Fibonacci"}]}' -H "Content-Type: application/json"`.
  5. 5
    IDE ํ†ตํ•ฉ: Copilot ํ™•์žฅ์„ `http://localhost:8000`์œผ๋กœ ์—ฐ๊ฒฐ.
  6. 6
    ๋ฐฐ์น˜ ์š”์ฒญ: ์—ฌ๋Ÿฌ ํ”„๋กฌํ”„ํŠธ๋ฅผ ๋ณ‘๋ ฌ๋กœ ์ „์†ก, vLLM์ด ํ•œ๊บผ๋ฒˆ์— ์ฒ˜๋ฆฌ.

3๋‹จ๊ณ„: ํ”„๋กœ๋•์…˜ ๋ฉ€ํ‹ฐ ์‚ฌ์šฉ์ž (2์‹œ๊ฐ„)

๋“€์–ผ GPU ์žฅ๋น„์—์„œ ๋™์‹œ ๊ฐœ๋ฐœ์ž 50๋ช… ์ด์ƒ ์ฒ˜๋ฆฌ ๊ฐ€๋Šฅ(๊ฐ๊ฐ ์ดˆ๋‹น 5 ํ† ํฐ). ๋น„์šฉ: ์ „๊ธฐ๋ฃŒ๋งŒ ํ•ด๋‹น(24/7 ์šด์˜ ์‹œ ์›” ์•ฝ $100).

  1. 1
    ๋ณ„๋„์˜ GPU์— vLLM ์ธ์Šคํ„ด์Šค 2๊ฐœ ๋ฐฐํฌ(GPU 0, GPU 1).
  2. 2
    ์–‘์ชฝ ์ธ์Šคํ„ด์Šค์— ์š”์ฒญ์„ ๋ถ„์‚ฐํ•˜๋„๋ก nginx ๊ตฌ์„ฑ.
  3. 3
    ๋ฉ”ํŠธ๋ฆญ ์ˆ˜์ง‘์„ ์œ„ํ•ด Prometheus ์„ค์ •(์š”์ฒญ ์ง€์—ฐ ์‹œ๊ฐ„, ์ดˆ๋‹น ํ† ํฐ, ์˜ค๋ฅ˜).
  4. 4
    ์‚ฌ์šฉ์ž๋ณ„ ์†๋„ ์ œํ•œ ์ถ”๊ฐ€(ํ† ํฐ ๋ฒ„ํ‚ท ์•Œ๊ณ ๋ฆฌ์ฆ˜).
  5. 5
    10Gbps ๋„คํŠธ์›Œํฌ๋ฅผ ๊ฐ–์ถ˜ ํด๋ผ์šฐ๋“œ VM ๋˜๋Š” ์˜จํ”„๋ ˆ๋ฏธ์Šค ์„œ๋ฒ„์— ๋ฐฐํฌ.
  6. 6
    Grafana ๋Œ€์‹œ๋ณด๋“œ๋กœ ๋ชจ๋‹ˆํ„ฐ๋ง(์„ ํƒ ์‚ฌํ•ญ).

IDE ํ†ตํ•ฉ (VS Code, Cursor)

์‹ค์‹œ๊ฐ„ ์ฝ”๋“œ ์™„์„ฑ ์„ค์ •:

๋Œ€์•ˆ(๋„ค์ดํ‹ฐ๋ธŒ IDE ์ง€์›): Cursor ์—๋””ํ„ฐ๋Š” ๋กœ์ปฌ LLM์„ ๊ธฐ๋ณธ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค(ํ™•์žฅ ๋ถˆํ•„์š”).

  1. 1
    "Continue" ํ™•์žฅ ์„ค์น˜(`continue.dev`).
  2. 2
    ํ™•์žฅ ์„ค์ •์„ ์—ด๊ณ , ์ปค์Šคํ…€ API ๊ตฌ์„ฑ: `http://localhost:8000/v1` (vLLM ์—”๋“œํฌ์ธํŠธ).
  3. 3
    ๋ชจ๋ธ ์ด๋ฆ„์„ vLLM ์„œ๋ฒ„์— ๋งž๊ฒŒ ์„ค์ •(`meta-llama/Llama-3.3-8B-Instruct`).
  4. 4
    Ctrl+Shift+Space(๋˜๋Š” cmd+shift+space)๋ฅผ ๋ˆŒ๋Ÿฌ ์™„์„ฑ ํŠธ๋ฆฌ๊ฑฐ.
  5. 5
    ์™„์„ฑ์ด ์‹ค์‹œ๊ฐ„์œผ๋กœ ์ŠคํŠธ๋ฆฌ๋ฐ๋ฉ๋‹ˆ๋‹ค(์ดˆ๋‹น 10-20 ํ† ํฐ).

๋””๋ฒ„๊น… ๋ฐ ๋ชจ๋‹ˆํ„ฐ๋ง

  • vLLM ๋กœ๊ทธ: ์˜ค๋ฅ˜์— ๋Œ€ํ•œ stdout ํ™•์ธ(๋ชจ๋ธ ๋กœ๋”ฉ, OOM, CUDA ์˜ค๋ฅ˜).
  • Prometheus ๋ฉ”ํŠธ๋ฆญ: vLLM์ด `/metrics` ์—”๋“œํฌ์ธํŠธ ๋…ธ์ถœ(์š”์ฒญ ์ˆ˜, ์ง€์—ฐ ์‹œ๊ฐ„ ํžˆ์Šคํ† ๊ทธ๋žจ, ์ƒ์„ฑ๋œ ํ† ํฐ).
  • ํ† ํฐ ์นด์šดํŒ…: ์ „์†ก ์ „์— `tiktoken` ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋กœ ํ† ํฐ ์ˆ˜ ๊ณ„์‚ฐ(OOM ์˜ˆ๊ธฐ์น˜ ์•Š์€ ์ƒํ™ฉ ๋ฐฉ์ง€).
  • ์ง€์—ฐ ์‹œ๊ฐ„ ํ”„๋กœํŒŒ์ผ๋ง: vLLM ํ˜ธ์ถœ ์ „ํ›„์— ํƒ€์ž„์Šคํƒฌํ”„ ๋กœ๊น…์„ ์ถ”๊ฐ€ํ•˜์—ฌ ๋ณ‘๋ชฉ ํ˜„์ƒ ํŒŒ์•….

์ง€์—ญ๋ณ„ ๋งฅ๋ฝ ๋ฐ ์ปดํ”Œ๋ผ์ด์–ธ์Šค

  • EU / GDPR (์œ ๋Ÿฝ): ๋กœ์ปฌ ์ถ”๋ก ์€ GDPR ์ œ28์กฐ๋ฅผ ์ถฉ์กฑํ•ฉ๋‹ˆ๋‹ค -- ๋ฐ์ดํ„ฐ๊ฐ€ ์ธํ”„๋ผ๋ฅผ ๋ฒ—์–ด๋‚˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. DPA ๋ถˆํ•„์š”. ์˜๋ฃŒ, ๋ฒ•๋ฅ , ๊ธˆ์œต ์›Œํฌ๋กœ๋“œ์— ๊ถŒ์žฅ๋ฉ๋‹ˆ๋‹ค. ๋…์ผ ๊ธฐ์—… ๋ฐฐํฌ๋ฅผ ์œ„ํ•œ BSI-Grundschutz-Kataloge ์ธ์ฆ.
  • ์ผ๋ณธ / METI: 2024๋…„ METI AI ๊ฑฐ๋ฒ„๋„Œ์Šค ๊ฐ€์ด๋“œ๋ผ์ธ์€ ๋ฏผ๊ฐํ•œ ๊ธฐ์—… ๋ฐ์ดํ„ฐ๋ฅผ ์œ„ํ•œ ์˜จํ”„๋ ˆ๋ฏธ์Šค ์ถ”๋ก ์„ ๊ถŒ์žฅํ•ฉ๋‹ˆ๋‹ค. vLLM + 3๋‹จ๊ณ„ ์„ค์ •์€ METI ๊ฐ์‚ฌ ์ถ”์  ์š”๊ฑด์„ ์ถฉ์กฑํ•ฉ๋‹ˆ๋‹ค.
  • ์ค‘๊ตญ / PIPL: ์ค‘๊ตญ ๊ฐœ์ธ์ •๋ณด๋ณดํ˜ธ๋ฒ•(2021)์€ ๋ฐ์ดํ„ฐ ๊ฑฐ์ฃผ์ง€๋ฅผ ์˜๋ฌดํ™”ํ•ฉ๋‹ˆ๋‹ค. 2/3๋‹จ๊ณ„ ๋กœ์ปฌ ์Šคํƒ์€ ๋ชจ๋“  ์ถ”๋ก ์„ ๊ตญ๋‚ด์— ์œ ์ง€ํ•ฉ๋‹ˆ๋‹ค. Alibaba Cloud ๋ฐ Tencent Cloud GPU ์ธ์Šคํ„ด์Šค์™€ ํ˜ธํ™˜๋ฉ๋‹ˆ๋‹ค.
  • ๋ฏธ๊ตญ: 2026๋…„ ๊ธฐ์ค€ ์—ฐ๋ฐฉ AI ๋ฐ์ดํ„ฐ ๊ฑฐ์ฃผ์ง€ ์˜๋ฌด ์—†์Œ. HIPAA ์ ์šฉ ๊ธฐ๊ด€์€ PHI๊ฐ€ ํ†ต์ œ๋œ ์ธํ”„๋ผ๋ฅผ ๋ฒ—์–ด๋‚˜์ง€ ์•Š๋„๋ก ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค -- 2/3๋‹จ๊ณ„๋Š” ๊ธฐ๋ณธ์ ์œผ๋กœ ์ด๋ฅผ ์ถฉ์กฑํ•ฉ๋‹ˆ๋‹ค.

์ผ๋ฐ˜์ ์ธ ์„ค์ • ์‹ค์ˆ˜

  • ๋‹ค๋ฅธ ํ”„๋กœ์„ธ์Šค(Discord, ๊ฒŒ์ž„)์™€ ๋™์ผํ•œ GPU์—์„œ vLLM ์‹คํ–‰. GPU ๋ฉ”๋ชจ๋ฆฌ ๋ถ€์กฑ ์˜ค๋ฅ˜ ๋ฐœ์ƒ.
  • ํƒ€์ž„์•„์›ƒ ์—†์ด ์š”์ฒญ ์ „์†ก. vLLM์ด ์ค‘๋‹จ๋˜๋ฉด ํด๋ผ์ด์–ธํŠธ๋„ ์˜์›ํžˆ ์ค‘๋‹จ๋ฉ๋‹ˆ๋‹ค. ํ•ญ์ƒ ์š”์ฒญ์— `timeout=60`์„ ์„ค์ •ํ•˜์‹ญ์‹œ์˜ค.
  • vLLM์ด ์—ฌ๋Ÿฌ GPU์—์„œ ์ž๋™์œผ๋กœ ํ™•์žฅ๋œ๋‹ค๊ณ  ๊ฐ€์ •. ๋ช…์‹œ์ ์ธ `--tensor-parallel-size` ํ”Œ๋ž˜๊ทธ ํ•„์š”.
  • ๋ฉ€ํ‹ฐ GPU์—์„œ CUDA_VISIBLE_DEVICES ์„ค์ • ๋ˆ„๋ฝ. vLLM์€ ๊ธฐ๋ณธ์ ์œผ๋กœ ๋ชจ๋“  GPU๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
  • 2026๋…„์— Llama 3.3 ๋ชจ๋ธ ์‚ฌ์šฉ. Meta๋Š” 2026๋…„ 1์›” ์ƒ์—…์  ์‚ฌ์šฉ์—์„œ Llama 3.3์„ ํ๊ธฐํ–ˆ์Šต๋‹ˆ๋‹ค. Apache 2.0 ๋ผ์ด์„ ์Šค๋กœ ์ œํ•œ ์—†๋Š” Llama 3.3 8B Instruct๋ฅผ ์‚ฌ์šฉํ•˜์‹ญ์‹œ์˜ค.
  • Llama 3.3๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์„ ๋•Œ Llama 3.3์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ. Llama 3.3 8B Instruct๋Š” ๋” ๋‚˜์€ ๋ช…๋ น์–ด ๋”ฐ๋ฅด๊ธฐ ๊ธฐ๋Šฅ์„ ์ œ๊ณตํ•˜๋ฉฐ, 2026๋…„ 4์›” ๊ธฐ์ค€ ๊ถŒ์žฅ ๊ธฐ๋ณธ๊ฐ’์ž…๋‹ˆ๋‹ค. `ollama run llama3.3:8b-instruct`๋ฅผ ์‚ฌ์šฉํ•˜์‹ญ์‹œ์˜ค.

FAQ

์–ด๋–ค ๋‹จ๊ณ„๋ฅผ ์‚ฌ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๊นŒ?

ํ˜ผ์ž ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ 1๋‹จ๊ณ„, ๋‹จ์ผ ๊ฐœ๋ฐœ์ž + IDE ํ†ตํ•ฉ์€ 2๋‹จ๊ณ„, ํŒ€ + 24/7 ์„œ๋น„์Šค๋Š” 3๋‹จ๊ณ„๋ฅผ ์‚ฌ์šฉํ•˜์‹ญ์‹œ์˜ค.

Ollama ๋Œ€์‹  vLLM์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๊นŒ?

๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค๋งŒ ์„ค์ •์ด ๋” ๋ณต์žกํ•ฉ๋‹ˆ๋‹ค. vLLM์€ ๋” ๋น ๋ฅด๊ณ (๋ฐฐ์น˜ ์ฒ˜๋ฆฌ) ๋” ์œ ์—ฐํ•ฉ๋‹ˆ๋‹ค(Python API).

์—ฌ๋Ÿฌ GPU์—์„œ ๋ชจ๋ธ์„ ์„œ๋น™ํ•˜๋ ค๋ฉด ์–ด๋–ป๊ฒŒ ํ•ด์•ผ ํ•ฉ๋‹ˆ๊นŒ?

vLLM: `--tensor-parallel-size 2`. ๋ชจ๋ธ์„ 2๊ฐœ์˜ GPU์— ๋ถ„์‚ฐํ•˜์—ฌ ์ฒ˜๋ฆฌ๋Ÿ‰์„ 2๋ฐฐ๋กœ ๋Š˜๋ฆฝ๋‹ˆ๋‹ค.

vLLM ์ถ”๋ก  ์œ„์—์„œ ํŒŒ์ธํŠœ๋‹ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๊นŒ?

๋ถˆ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค. ํŒŒ์ธํŠœ๋‹์€ ๋ณ„๋„๋กœ(HuggingFace Transformers), ๊ทธ๋Ÿฐ ๋‹ค์Œ ํŒŒ์ธํŠœ๋‹๋œ ๋ชจ๋ธ์„ vLLM์— ๋กœ๋“œํ•˜์‹ญ์‹œ์˜ค.

vLLM์ด OOM ์˜ค๋ฅ˜๋ฅผ ๋ฐœ์ƒ์‹œํ‚ค๋ฉด ์–ด๋–ป๊ฒŒ ํ•ด์•ผ ํ•ฉ๋‹ˆ๊นŒ?

๋” ์ž‘์€ ์–‘์žํ™”(Q4 ๋Œ€ Q8)๋ฅผ ์‚ฌ์šฉํ•˜๊ฑฐ๋‚˜, ๋ฐฐ์น˜ ํฌ๊ธฐ๋ฅผ ์ค„์ด๊ฑฐ๋‚˜, ๋ชจ๋ธ๋‹น VRAM ํ• ๋‹น๋Ÿ‰์„ ์ค„์ด์‹ญ์‹œ์˜ค. `nvidia-smi`๋ฅผ ํ™•์ธํ•˜์‹ญ์‹œ์˜ค.

3๋‹จ๊ณ„๋Š” ํ”„๋กœ๋•์…˜ ์ค€๋น„๊ฐ€ ๋˜์–ด ์žˆ์Šต๋‹ˆ๊นŒ?

๋ชจ๋‹ˆํ„ฐ๋ง์„ ์ถ”๊ฐ€ํ•˜๋ฉด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค. Prometheus, Grafana, ์•Œ๋ฆผ(Alertmanager)์„ ์ถ”๊ฐ€ํ•˜์‹ญ์‹œ์˜ค. ํ‘œ์ค€ ์ธํ”„๋ผ ํŒจํ„ด์ž…๋‹ˆ๋‹ค.

์ถœ์ฒ˜

A Note on Third-Party Facts

This article references third-party AI models, benchmarks, prices, and licenses. The AI landscape changes rapidly. Benchmark scores, license terms, model names, and API prices can shift between the time of writing and the time you read this. Before making deployment or compliance decisions based on this article, verify current figures on each providerโ€™s official source: Hugging Face model cards for licenses and benchmarks, provider websites for API pricing, and EUR-Lex for current GDPR and EU AI Act text. This article reflects publicly available information as of May 2026.

Run PromptQuorum with a local LLM, your own API keys, or both โ€” you pick the backend.

Join the PromptQuorum Waitlist โ†’

โ† Back to Local LLMs

๋กœ์ปฌ LLM ๊ฐœ๋ฐœ ์Šคํƒ: CLI โ†’ API โ†’ ํ”„๋กœ๋•์…˜ ์„ค์ • ๊ฐ€์ด๋“œ 2026 | PromptQuorum