Skip to main content
PromptQuorumPromptQuorum
Home/Local LLMs/์‚ฌ์šฉ ์‚ฌ๋ก€๋ณ„ ์ตœ์  ๋กœ์ปฌ LLM ์Šคํƒ 2026: ๊ธ€์“ฐ๊ธฐ, ์ฝ”๋”ฉ, RAG, ์—์ด์ „ํŠธ
๋„๊ตฌ ๋ฐ ์ธํ„ฐํŽ˜์ด์Šค

์‚ฌ์šฉ ์‚ฌ๋ก€๋ณ„ ์ตœ์  ๋กœ์ปฌ LLM ์Šคํƒ 2026: ๊ธ€์“ฐ๊ธฐ, ์ฝ”๋”ฉ, RAG, ์—์ด์ „ํŠธ

ยท10๋ถ„ ๋ถ„๋Ÿ‰ยทBy Hans Kuepper ยท Founder of PromptQuorum, multi-model AI dispatch tool ยท PromptQuorum

์ตœ์ ์˜ ๋กœ์ปฌ LLM ์Šคํƒ์€ ์›Œํฌํ”Œ๋กœ์— ๋”ฐ๋ผ ๋‹ค๋ฆ…๋‹ˆ๋‹ค. ์ž‘๊ฐ€์—๊ฒŒ๋Š” OpenWebUI + Llama 3, ๊ฐœ๋ฐœ์ž์—๊ฒŒ๋Š” vLLM + Python SDK, ์—ฐ๊ตฌ์ž์—๊ฒŒ๋Š” LangGraph + ์ปค์Šคํ…€ ์Šคํฌ๋ฆฝํŠธ๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. 2026๋…„ 4์›” ๊ธฐ์ค€, ๋ชจ๋“  ์˜์—ญ์—์„œ ์šฐ์ˆ˜ํ•œ ๋‹จ์ผ ๋„๊ตฌ๋Š” ์กด์žฌํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

์ตœ์ ์˜ ๋กœ์ปฌ LLM ์Šคํƒ์€ ์›Œํฌํ”Œ๋กœ์— ๋”ฐ๋ผ ๋‹ค๋ฆ…๋‹ˆ๋‹ค. ์ž‘๊ฐ€์—๊ฒŒ๋Š” Ollama + OpenWebUI + Llama 3.3, ๊ฐœ๋ฐœ์ž์—๊ฒŒ๋Š” vLLM + Qwen3-Coder + IDE ํ™•์žฅ, ์—ฐ๊ตฌ์ž์—๊ฒŒ๋Š” LangGraph + vLLM์ด ์ ํ•ฉํ•ฉ๋‹ˆ๋‹ค. 2026๋…„ 4์›” ๊ธฐ์ค€, ๋ชจ๋“  ์˜์—ญ์—์„œ ์šฐ์ˆ˜ํ•œ ๋‹จ์ผ ๋„๊ตฌ๋Š” ์กด์žฌํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ์ด ๊ฐ€์ด๋“œ๋Š” 7๊ฐ€์ง€ ์ผ๋ฐ˜์ ์ธ ์‚ฌ์šฉ ์‚ฌ๋ก€๋ฅผ ์ตœ์ ์˜ ์Šคํƒ(๋ฐฑ์—”๋“œ + UI + ํ†ตํ•ฉ)๊ณผ ํ•˜๋“œ์›จ์–ด ๋“ฑ๊ธ‰(VRAM 8~24 GB)์— ๋งคํ•‘ํ•ฉ๋‹ˆ๋‹ค.

Key Takeaways

  • ๊ธ€์“ฐ๊ธฐ/์ฝ˜ํ…์ธ  ์ œ์ž‘: Ollama + OpenWebUI. ๋ณ„๋„ ์„ค์ • ๋ถˆํ•„์š”, ์šฐ์ˆ˜ํ•œ ์ฑ„ํŒ… UI, ์ปจํ…์ŠคํŠธ ์ฐฝ ์กฐ์ ˆ ๊ฐ€๋Šฅ.
  • ์ฝ”๋”ฉ/์ฝ”๋“œ ๋ฆฌ๋ทฐ: vLLM + FastAPI + VS Code ํ™•์žฅ. ๋ฐฐ์น˜ ์ฒ˜๋ฆฌ, ๋ณ‘๋ ฌ ์ถ”๋ก , ์ŠคํŠธ๋ฆฌ๋ฐ ์ง€์›.
  • ๋กœ์ปฌ RAG: LlamaIndex + Ollama/vLLM + Qdrant ๋ฒกํ„ฐ DB. ๋ฌธ์„œ ์ฒญํ‚น, ์ž„๋ฒ ๋”ฉ, ๊ฒ€์ƒ‰์ด ํ†ตํ•ฉ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.
  • AI ์—์ด์ „ํŠธ: LangGraph + vLLM ๋ฐฑ์—”๋“œ. ๋„๊ตฌ ์‚ฌ์šฉ, ๋ฉ”๋ชจ๋ฆฌ, ๊ณ„ํš ๋ฃจํ”„. ํ•™์Šต ๊ณก์„ ์ด ๊ฐ€ํŒŒ๋ฆ…๋‹ˆ๋‹ค.
  • ๋‹ค์ค‘ ์‚ฌ์šฉ์ž API: nginx ๋’ค์˜ vLLM. ๋™์‹œ ์š”์ฒญ 10๊ฑด ์ด์ƒ ์ฒ˜๋ฆฌ ๊ฐ€๋Šฅ. ๊ฐ€์žฅ ํ™•์žฅ์„ฑ์ด ๋›ฐ์–ด๋‚ฉ๋‹ˆ๋‹ค.
  • ํŒŒ์ธํŠœ๋‹: HuggingFace Transformers + LoRA + Ollama(์ถ”๋ก ์šฉ). ํ•™์Šต๊ณผ ์„œ๋น™์„ ๋ถ„๋ฆฌํ•˜์—ฌ ์ง„ํ–‰ํ•ฉ๋‹ˆ๋‹ค.
  • ์‹ค์‹œ๊ฐ„ ์ŠคํŠธ๋ฆฌ๋ฐ: Ollama(๋„ค์ดํ‹ฐ๋ธŒ ์ŠคํŠธ๋ฆฌ๋ฐ) ๋˜๋Š” vLLM + ํ† ํฐ ์ŠคํŠธ๋ฆฌ๋ฐ ์—”๋“œํฌ์ธํŠธ. ์ฑ—๋ด‡์—์„œ ์ตœ๊ณ ์˜ UX๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

๋น ๋ฅธ ๊ฒฐ์ •: ํ•˜๋“œ์›จ์–ด ๋“ฑ๊ธ‰๋ณ„ ์Šคํƒ (2026๋…„ 4์›”)

GPU/VRAM์— ๋งž๋Š” ์ตœ์  ์Šคํƒ์„ ์„ ํƒํ•˜์‹ญ์‹œ์˜ค. ๊ฐ ์กฐํ•ฉ์€ ์‹ค์ œ ๋ฒค์น˜๋งˆํฌ๋ฅผ ๊ธฐ์ค€์œผ๋กœ ํ…Œ์ŠคํŠธ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์ฝ”๋”ฉ ๋ฐ ์—์ด์ „ํŠธ ์›Œํฌํ”Œ๋กœ๋Š” ๊ธ€์“ฐ๊ธฐ๋ณด๋‹ค ๋Œ€ํ˜• ๋ชจ๋ธ์˜ ํ˜œํƒ์„ ๋” ํฌ๊ฒŒ ๋ฐ›์œผ๋ฉฐ, RAG๋Š” LLM ํฌ๊ธฐ๋ณด๋‹ค ์ž„๋ฒ ๋”ฉ ํ’ˆ์งˆ์ด ๋” ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค.

ํ•˜๋“œ์›จ์–ด๊ธ€์“ฐ๊ธฐ์ฝ”๋”ฉRAG์—์ด์ „ํŠธ
4~8 GB VRAM (GTX 1660, RTX 3050)Ollama + Phi-4 MiniOllama + Qwen3-Coder-1.5BLlamaIndex + Phi-4 Mini๊ถŒ์žฅํ•˜์ง€ ์•Š์Œ
12 GB VRAM (RTX 3060, RTX 4070)Ollama + Llama 3.2 8BvLLM + Qwen3-Coder-7BLlamaIndex + Llama 3.2 8BLangGraph + Ollama (๋А๋ฆผ)
16 GB VRAM (RTX 4070 Ti, RTX 4080)Ollama + Mistral Small 3.1vLLM + Qwen3-Coder-14BLlamaIndex + Mistral 3.1LangGraph + vLLM
24 GB VRAM (RTX 3090, RTX 4090)Ollama + Llama 3.3 70B Q4vLLM + Qwen3-Coder-32BLlamaIndex + Llama 3.3 70BLangGraph + vLLM (๊ฐ€์žฅ ๋น ๋ฆ„)

**์ตœ์  ์Šคํƒ: Ollama + OpenWebUI + ๋งˆํฌ๋‹ค์šด ํŽธ์ง‘๊ธฐ**

์ด ์Šคํƒ์„ ์„ ํƒํ•˜๋Š” ์ด์œ : OpenWebUI๋Š” ์ตœ๊ณ ์˜ ์ฑ„ํŒ… UX๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ์ฝ”๋”ฉ์ด ํ•„์š”ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ์ปจํ…์ŠคํŠธ ์ฐฝ ์œ ์—ฐ์„ฑ(4K~32K)์€ ์žฅ๋ฌธ ๊ธ€์“ฐ๊ธฐ์—์„œ LM Studio๋ณด๋‹ค ๋›ฐ์–ด๋‚ฉ๋‹ˆ๋‹ค. ์ž‘๊ฐ€์—๊ฒŒ ํด๋ผ์šฐ๋“œ API๋ณด๋‹ค ๊ฒฝ์ œ์ ์ž…๋‹ˆ๋‹ค.

  1. 1
    VRAM 24 GB์˜ ๊ฒฝ์šฐ: `ollama pull llama3.3:70b` โ€” ์ตœ๊ณ  ํ’ˆ์งˆ, ๊ธ€์“ฐ๊ธฐ ๋ฒค์น˜๋งˆํฌ์—์„œ GPT-4(2023)์— ํ•„์ ํ•ฉ๋‹ˆ๋‹ค.
  2. 2
    VRAM 16 GB์˜ ๊ฒฝ์šฐ: `ollama pull mistral-small3.1` โ€” 128K ์ปจํ…์ŠคํŠธ, 24 GB ๋ฏธ๋งŒ์—์„œ ์ตœ๊ณ  ํ’ˆ์งˆ.
  3. 3
    VRAM 8 GB์˜ ๊ฒฝ์šฐ: `ollama pull llama3.2:8b` โ€” ์–‘ํ˜ธํ•œ ๊ธ€์“ฐ๊ธฐ ํ’ˆ์งˆ, ์†Œ๋น„์ž ํ•˜๋“œ์›จ์–ด์—์„œ ๋น ๋ฆ…๋‹ˆ๋‹ค.
  4. 4
    Docker๋ฅผ ํ†ตํ•ด OpenWebUI ์„ค์น˜: `docker run -d -p 3000:8080 ghcr.io/open-webui/open-webui:latest`.
  5. 5
    ๋ฌธ์„œ ๊ธธ์ด์— ๋”ฐ๋ผ OpenWebUI ์„ค์ •์—์„œ ์ปจํ…์ŠคํŠธ ์ฐฝ(8K~32K ํ† ํฐ)์„ ๊ตฌ์„ฑํ•˜์‹ญ์‹œ์˜ค.

**์ตœ์  ์Šคํƒ: vLLM + Qwen3-Coder + IDE ํ™•์žฅ**

์ด ์Šคํƒ์„ ์„ ํƒํ•˜๋Š” ์ด์œ : Qwen3-Coder๋Š” HumanEval์—์„œ 82%๋ฅผ ๊ธฐ๋กํ•ฉ๋‹ˆ๋‹ค(2026๋…„ 4์›” ๊ธฐ์ค€ ์ตœ๊ณ ์˜ ์˜คํ”ˆ์†Œ์Šค ์ฝ”๋”ฉ ๋ชจ๋ธ). vLLM์€ ๋ฐฐ์น˜ ์ถ”๋ก ์—์„œ Ollama๋ณด๋‹ค 3~5๋ฐฐ ๋น ๋ฆ…๋‹ˆ๋‹ค. ๋„ค์ดํ‹ฐ๋ธŒ OpenAI API ํ˜ธํ™˜์„ฑ์œผ๋กœ ๊ธฐ์กด IDE ๋„๊ตฌ์— ์ ํ•ฉํ•ฉ๋‹ˆ๋‹ค. ์‹ค์‹œ๊ฐ„ ์ œ์•ˆ์„ ์œ„ํ•œ ์ŠคํŠธ๋ฆฌ๋ฐ์ด ํ™œ์„ฑํ™”๋ฉ๋‹ˆ๋‹ค.

์—ฌ๋Ÿฌ ํŒŒ์ผ์— ๋Œ€ํ•œ AI ๊ธฐ๋ฐ˜ ์ฝ”๋“œ ๋ฆฌ๋ทฐ

์—ฌ๋Ÿฌ ํŒŒ์ผ์— ๋Œ€ํ•œ ์ž๋™ ์ฝ”๋“œ ๋ฆฌ๋ทฐ๋ฅผ ์œ„ํ•ด vLLM์˜ ๋ฐฐ์น˜ ์ฒ˜๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜์‹ญ์‹œ์˜ค:

  1. 1
    vLLM ์„ค์น˜: `pip install vllm`.
  2. 2
    Qwen3-Coder-7B๋กœ vLLM ์„œ๋ฒ„ ์‹œ์ž‘: `python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen3-Coder-7B-Instruct --port 8000`.
  3. 3
    VRAM 16 GB ์ด์ƒ์˜ ๊ฒฝ์šฐ 14B ์‚ฌ์šฉ: `--model Qwen/Qwen3-Coder-14B-Instruct`.
  4. 4
    IDE ํ™•์žฅ(VS Code Continue.dev, Cursor ๋“ฑ)์„ `http://localhost:8000/v1`์— ์—ฐ๊ฒฐํ•˜์‹ญ์‹œ์˜ค.
  5. 5
    ์ฝ”๋“œ ๋ฆฌ๋ทฐ๋ฅผ ์œ„ํ•œ ๋ฐฐ์น˜ ์ฒ˜๋ฆฌ ํ™œ์„ฑํ™”: ๋‹จ์ผ API ํ˜ธ์ถœ๋กœ 10๊ฐœ ํŒŒ์ผ์„ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค(`vllm`์€ ๊ธฐ๋ณธ์ ์œผ๋กœ batch=10์„ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค).
python
# vLLM ๋ฐฐ์น˜ ์ฒ˜๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ 10๊ฐœ ํŒŒ์ผ์„ ๋ณ‘๋ ฌ๋กœ ๋ฆฌ๋ทฐ
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

code_files = [
    ("utils.py", open("utils.py").read()),
    ("models.py", open("models.py").read()),
    # ... ์ตœ๋Œ€ 10๊ฐœ ํŒŒ์ผ
]

# vLLM์€ 10๊ฐœ๋ฅผ ๋ณ‘๋ ฌ๋กœ ์ฒ˜๋ฆฌ (๋ฐฐ์น˜ ์š”์ฒญ 1ํšŒ)
reviews = []
for filename, code in code_files:
    prompt = f"Review this code for bugs, style, and performance:

{code}"
    response = client.chat.completions.create(
        model="Qwen3-Coder-7B-Instruct",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.2,  # ๋ฆฌ๋ทฐ ์ž‘์—…์—๋Š” ๊ฒฐ์ •๋ก ์  ์„ค์ • ๊ถŒ์žฅ
    )
    reviews.append((filename, response.choices[0].message.content))

for filename, review in reviews:
    print(f"=== {filename} ===
{review}
")

์ตœ์  ์Šคํƒ: LlamaIndex + Ollama/vLLM + Qdrant + FastAPI UI

์ด ์Šคํƒ์„ ์„ ํƒํ•˜๋Š” ์ด์œ : LlamaIndex๋Š” ์ฒญํ‚น + ๊ฒ€์ƒ‰์„ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค. Qdrant๋Š” ๋น ๋ฅด๊ณ , ๋กœ์ปฌ์—์„œ ์‹คํ–‰๋˜๋ฉฐ, ํ”„๋ผ์ด๋ฒ„์‹œ๋ฅผ ๋ณดํ˜ธํ•ฉ๋‹ˆ๋‹ค. Ollama๋Š” ์ž„๋ฒ ๋”ฉ์„ ๋ฌด๋ฃŒ๋กœ ์ œ๊ณตํ•˜๊ฑฐ๋‚˜ vLLM์„ LLM ์ถ”๋ก ์— ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  1. 1
    LlamaIndex ์„ค์น˜(`pip install llama-index`).
  2. 2
    LlamaIndex์— ๋ฌธ์„œ(PDF, TXT, ๋งˆํฌ๋‹ค์šด)๋ฅผ ๋กœ๋“œํ•ฉ๋‹ˆ๋‹ค.
  3. 3
    ๋ฌธ์„œ๋ฅผ ์ฒญํ‚น(๊ธฐ๋ณธ 1024 ํ† ํฐ)ํ•˜๊ณ  ๋กœ์ปฌ ๋ชจ๋ธ ๋˜๋Š” OpenAI(๋ฐฑ์—…)๋กœ ์ž„๋ฒ ๋”ฉํ•ฉ๋‹ˆ๋‹ค.
  4. 4
    Docker๋ฅผ ํ†ตํ•ด ๋กœ์ปฌ์—์„œ ์‹คํ–‰๋˜๋Š” Qdrant ๋ฒกํ„ฐ DB์— ์ž„๋ฒ ๋”ฉ์„ ์ €์žฅํ•ฉ๋‹ˆ๋‹ค.
  5. 5
    LlamaIndex๋ฅผ ํ†ตํ•ด ์ฟผ๋ฆฌํ•ฉ๋‹ˆ๋‹ค: ์ƒ์œ„ K๊ฐœ์˜ ์œ ์‚ฌ ๋ฌธ์„œ๋ฅผ ๊ฒ€์ƒ‰ํ•˜๊ณ  ์ปจํ…์ŠคํŠธ์™€ ํ•จ๊ป˜ LLM์— ํ”„๋กฌํ”„ํŠธ๋ฅผ ์ „๋‹ฌํ•ฉ๋‹ˆ๋‹ค.
  6. 6
    ์›น UI ๋˜๋Š” IDE ํ†ตํ•ฉ์„ ์œ„ํ•ด FastAPI ์—”๋“œํฌ์ธํŠธ๋กœ ๋ž˜ํ•‘ํ•ฉ๋‹ˆ๋‹ค.

์ตœ์  ์Šคํƒ: LangGraph + vLLM + ๋„๊ตฌ ์ •์˜

์ด ์Šคํƒ์„ ์„ ํƒํ•˜๋Š” ์ด์œ : LangGraph๋Š” ๊ตฌ์กฐํ™”๋œ ์—์ด์ „ํŠธ ํ”Œ๋กœ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. vLLM์€ ์ˆœ์ฐจ์  LLM ํ˜ธ์ถœ 10ํšŒ ์ด์ƒ์—๋„ ์ถฉ๋ถ„ํžˆ ๋น ๋ฆ…๋‹ˆ๋‹ค. ๋„๊ตฌ ์‚ฌ์šฉ์ด ๋ช…์‹œ์ ์ด๊ณ  ๋””๋ฒ„๊ทธ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

  1. 1
    LangGraph ์„ค์น˜(`pip install langchain langgraph`).
  2. 2
    ๋„๊ตฌ(์›น ๊ฒ€์ƒ‰, ๊ณ„์‚ฐ๊ธฐ, ํŒŒ์ผ I/O)๋ฅผ ํ•จ์ˆ˜ ์„œ๋ช…์œผ๋กœ ์ •์˜ํ•ฉ๋‹ˆ๋‹ค.
  3. 3
    LLM์„ ๊ฒฐ์ • ๋…ธ๋“œ๋กœ, ๋„๊ตฌ๋ฅผ ์•ก์…˜ ๋…ธ๋“œ๋กœ ํ•˜๋Š” ์—์ด์ „ํŠธ ๊ทธ๋ž˜ํ”„๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
  4. 4
    ํƒ€์ดํŠธ ๋ฃจํ”„์—์„œ ๋‚ฎ์€ ์ง€์—ฐ ์‹œ๊ฐ„์˜ LLM ํ˜ธ์ถœ์„ ์œ„ํ•ด vLLM ๋ฐฑ์—”๋“œ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
  5. 5
    ์—์ด์ „ํŠธ ๋ฃจํ”„๋ฅผ ์‹คํ–‰ํ•ฉ๋‹ˆ๋‹ค: LLM โ†’ ๋„๊ตฌ ์„ ํƒ โ†’ ๋„๊ตฌ ์‹คํ–‰ โ†’ ์™„๋ฃŒ๊นŒ์ง€ ๋ฐ˜๋ณต.

์ตœ์  ์Šคํƒ: vLLM + nginx ๋กœ๋“œ ๋ฐธ๋Ÿฐ์„œ + ๋ชจ๋‹ˆํ„ฐ๋ง

์ด ์Šคํƒ์„ ์„ ํƒํ•˜๋Š” ์ด์œ : vLLM์€ ๋ถ„์‚ฐ ์„œ๋น™์„ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค. Nginx๋Š” ์š”์ฒญ์„ ๋ฉ€ํ‹ฐํ”Œ๋ ‰์‹ฑํ•ฉ๋‹ˆ๋‹ค. ๋“€์–ผ GPU ์‹œ์Šคํ…œ์—์„œ ๋™์‹œ ์‚ฌ์šฉ์ž 10๋ช… ์ด์ƒ์œผ๋กœ ํ™•์žฅ๋ฉ๋‹ˆ๋‹ค. ์‚ฌ์šฉ์ž๋ณ„ ํ† ํฐ ์ฒ˜๋ฆฌ๋Ÿ‰์„ ๋ชจ๋‹ˆํ„ฐ๋งํ•ฉ๋‹ˆ๋‹ค.

  1. 1
    ๊ณ ์ • ํฌํŠธ์—์„œ `--served-model-name model-name`์œผ๋กœ vLLM์„ ๋ฐฐํฌํ•ฉ๋‹ˆ๋‹ค.
  2. 2
    2๊ฐœ ์ด์ƒ์˜ vLLM ์ธ์Šคํ„ด์Šค(๋ฉ€ํ‹ฐ GPU์ธ ๊ฒฝ์šฐ GPU๋‹น ํ•˜๋‚˜)์— ๊ฑธ์ณ nginx ๋กœ๋“œ ๋ฐธ๋Ÿฐ์‹ฑ์„ ๊ตฌ์„ฑํ•ฉ๋‹ˆ๋‹ค.
  3. 3
    ํด๋ผ์ด์–ธํŠธ ํ˜ธํ™˜์„ฑ์„ ์œ„ํ•ด OpenAI ํ˜ธํ™˜ `/v1/chat/completions` ์—”๋“œํฌ์ธํŠธ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
  4. 4
    Prometheus ์Šคํฌ๋ ˆ์ดํ”„ ์—”๋“œํฌ์ธํŠธ๋ฅผ ํ†ตํ•ด ๋ชจ๋‹ˆํ„ฐ๋งํ•ฉ๋‹ˆ๋‹ค(vLLM์€ ์š”์ฒญ ์ง€์—ฐ ์‹œ๊ฐ„, ์ฒ˜๋ฆฌ๋Ÿ‰ ๋ฉ”ํŠธ๋ฆญ์„ ๋‚ด๋ณด๋ƒ…๋‹ˆ๋‹ค).
  5. 5
    ์‚ฌ์šฉ์ž๋ณ„ ํ† ํฐ ๋ฒ„ํ‚ท ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ ์†๋„ ์ œํ•œ์„ ์„ค์ •ํ•ฉ๋‹ˆ๋‹ค.

์ตœ์  ์Šคํƒ: HuggingFace Transformers + LoRA + Ollama(์ถ”๋ก )

์ด ์Šคํƒ์„ ์„ ํƒํ•˜๋Š” ์ด์œ : LoRA๋Š” ํŒŒ์ธํŠœ๋‹ VRAM์„ 10๋ถ„์˜ 1๋กœ ์ค„์ž…๋‹ˆ๋‹ค. Ollama๋Š” ํŒŒ์ธํŠœ๋‹๋œ ๋ชจ๋ธ์„ ์‰ฝ๊ฒŒ ๋กœ๋“œํ•ฉ๋‹ˆ๋‹ค. ๋ชจ๋“ˆ์‹ ๊ตฌ์กฐ: ํ•œ ๋ฐ•์Šค์—์„œ ํ•™์Šตํ•˜๊ณ  ๋‹ค๋ฅธ ๋ฐ•์Šค์—์„œ ์„œ๋น™ํ•ฉ๋‹ˆ๋‹ค.

์ฐธ๊ณ  ์‚ฌํ•ญ (2026๋…„ 4์›”): Meta๋Š” ์ƒ์—…์  ํŒŒ์ธํŠœ๋‹์—์„œ Llama 3.3์„ ์ง€์› ์ค‘๋‹จํ–ˆ์Šต๋‹ˆ๋‹ค. Apache 2.0 / ์˜คํ”ˆ์†Œ์Šค ๋ผ์ด์„ ์Šค ์กฐ๊ฑด์„ ์œ„ํ•ด Llama 3.2(`meta-llama/Llama-3.2-1B` ๋˜๋Š” ๋” ํฐ ๋ชจ๋ธ) ๋˜๋Š” Qwen3(`Qwen/Qwen3-7B`)์—์„œ ํŒŒ์ธํŠœ๋‹ํ•˜์‹ญ์‹œ์˜ค. ๋‘ ๋ชจ๋ธ ๋ชจ๋‘ LoRA๋ฅผ ์ง€์›ํ•˜๊ณ  Ollama์—์„œ ์‰ฝ๊ฒŒ ๋กœ๋“œ๋ฉ๋‹ˆ๋‹ค.

  1. 1
    `peft` ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ(LoRA)๋กœ ํŒŒ์ธํŠœ๋‹ํ•˜์—ฌ VRAM ์‚ฌ์šฉ๋Ÿ‰์„ ์ค„์ž…๋‹ˆ๋‹ค.
  2. 2
    ํ•™์Šต: ๋ชจ๋ธ VRAM์˜ 4๋ฐฐ๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค(์˜ตํ‹ฐ๋งˆ์ด์ € ์ƒํƒœ, ๊ทธ๋ž˜๋””์–ธํŠธ). ์ถ”๋ก ๊ณผ ๋ณ„๋„๋กœ ์‹คํ–‰ํ•ฉ๋‹ˆ๋‹ค.
  3. 3
    LoRA ์–ด๋Œ‘ํ„ฐ๋ฅผ HuggingFace Hub ๋˜๋Š” ๋กœ์ปฌ ํŒŒ์ผ ์‹œ์Šคํ…œ์œผ๋กœ ๋‚ด๋ณด๋ƒ…๋‹ˆ๋‹ค.
  4. 4
    Ollama์—์„œ ํŒŒ์ธํŠœ๋‹๋œ ๋ชจ๋ธ ๋กœ๋“œ: `ollama create mymodel -f Modelfile`.
  5. 5
    ๋˜๋Š” RLHF๋ฅผ ์œ„ํ•ด HuggingFace TRL(Transformers Reinforcement Learning)์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

์ตœ์  ์Šคํƒ: Ollama(๋„ค์ดํ‹ฐ๋ธŒ ์ŠคํŠธ๋ฆฌ๋ฐ) ๋˜๋Š” vLLM + Server-Sent Events (SSE)

์ด ์Šคํƒ์„ ์„ ํƒํ•˜๋Š” ์ด์œ : ์ŠคํŠธ๋ฆฌ๋ฐ์€ ์ฒด๊ฐ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ต๋‹ˆ๋‹ค(์‚ฌ์šฉ์ž๊ฐ€ ํ† ํฐ์ด ๋‚˜ํƒ€๋‚˜๋Š” ๊ฒƒ์„ ํ™•์ธ). Ollama๊ฐ€ ๊ฐ€์žฅ ๊ฐ„๋‹จํ•ฉ๋‹ˆ๋‹ค. vLLM์€ ํ† ํฐ ์ฒ˜๋ฆฌ๋Ÿ‰์ด ๊ฐ€์žฅ ๋น ๋ฆ…๋‹ˆ๋‹ค.

  1. 1
    Ollama: `stream: true`๋กœ `/api/generate`๋ฅผ ํ˜ธ์ถœํ•ฉ๋‹ˆ๋‹ค. ํ† ํฐ์€ ์ค„๋ฐ”๊ฟˆ์œผ๋กœ ๊ตฌ๋ถ„๋œ JSON์œผ๋กœ ๋„์ฐฉํ•ฉ๋‹ˆ๋‹ค.
  2. 2
    vLLM: `stream: true`๋กœ `/v1/chat/completions`๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. OpenAI ํ˜ธํ™˜ SSE ์ŠคํŠธ๋ฆผ์„ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค.
  3. 3
    ํ”„๋ก ํŠธ์—”๋“œ: EventSource API(JavaScript)๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ŠคํŠธ๋ฆผ์„ ์†Œ๋น„ํ•˜๊ณ  ํ† ํฐ๋‹น UI๋ฅผ ์—…๋ฐ์ดํŠธํ•ฉ๋‹ˆ๋‹ค.
  4. 4
    ์ตœ์ € ์ง€์—ฐ ์‹œ๊ฐ„์„ ์œ„ํ•ด ๋ฐฐ์น˜ ์ฒ˜๋ฆฌ๋ฅผ ๋น„ํ™œ์„ฑํ™”ํ•ฉ๋‹ˆ๋‹ค(batch=1).

Ollama์™€ vLLM ์ค‘ ์–ด๋–ค ๊ฒƒ์„ ์‚ฌ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๊นŒ?

Ollama๋Š” ์ฑ„ํŒ… UI์™€ ๊ฐ„ํŽธํ•จ์— ์ ํ•ฉํ•ฉ๋‹ˆ๋‹ค. vLLM์€ API ์„œ๋ฒ„, ๋ฐฐ์น˜ ์ฒ˜๋ฆฌ, ์„ฑ๋Šฅ์— ์ ํ•ฉํ•ฉ๋‹ˆ๋‹ค. ์ƒํ˜ธ ๋ฐฐํƒ€์ ์ด์ง€ ์•Š์œผ๋ฏ€๋กœ ๋‘˜ ๋‹ค ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Ollama๋ฅผ ํ”„๋กœ๋•์…˜ API์— ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๊นŒ?

์˜ˆ, ํ•˜์ง€๋งŒ vLLM์ด ๋” ๋น ๋ฆ…๋‹ˆ๋‹ค(์ฒ˜๋ฆฌ๋Ÿ‰ 3~5๋ฐฐ ๋†’์Œ). Ollama๋Š” ์ดˆ๋‹น 10๊ฑด ๋ฏธ๋งŒ ์š”์ฒญ์— ์ ํ•ฉํ•ฉ๋‹ˆ๋‹ค. ์ดˆ๋‹น 10๊ฑด ์ด์ƒ์—๋Š” vLLM์„ ์‚ฌ์šฉํ•˜์‹ญ์‹œ์˜ค.

์ฝ”๋“œ ๋ฆฌ๋ทฐ์— ๊ฐ€์žฅ ์ ํ•ฉํ•œ ๋กœ์ปฌ LLM์€ ๋ฌด์—‡์ž…๋‹ˆ๊นŒ?

vLLM + Qwen3-Coder-7B-Instruct์ž…๋‹ˆ๋‹ค. Qwen3-Coder๋Š” HumanEval์—์„œ 82%๋ฅผ ๊ธฐ๋กํ•ฉ๋‹ˆ๋‹ค(์ตœ๊ณ ์˜ ์˜คํ”ˆ์†Œ์Šค). vLLM์€ 10๊ฐœ ํŒŒ์ผ์„ ๋ณ‘๋ ฌ๋กœ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค. RTX 3060 12GB์—์„œ ์•ฝ 30~50 tok/sec์ž…๋‹ˆ๋‹ค.

๊ฐ„๋‹จํ•œ RAG์— ๋ฒกํ„ฐ DB๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๊นŒ?

๋ฌธ์„œ 100๊ฐœ ๋ฏธ๋งŒ์˜ ๊ฒฝ์šฐ: ์ธ๋ฉ”๋ชจ๋ฆฌ ์ž„๋ฒ ๋”ฉ(np.ndarray)์œผ๋กœ ์ถฉ๋ถ„ํ•ฉ๋‹ˆ๋‹ค. 100๊ฐœ ์ด์ƒ์˜ ๊ฒฝ์šฐ: ๋ฉ”๋ชจ๋ฆฌ ๊ณผ๋ถ€ํ•˜๋ฅผ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด Qdrant ๋˜๋Š” Weaviate๋ฅผ ์‚ฌ์šฉํ•˜์‹ญ์‹œ์˜ค.

LangGraph๋Š” ๊ฐ„๋‹จํ•œ ์ฑ—๋ด‡์— ๊ณผ๋„ํ•ฉ๋‹ˆ๊นŒ?

์˜ˆ. Ollama ๋˜๋Š” vLLM๋งŒ ์‚ฌ์šฉํ•˜์‹ญ์‹œ์˜ค. LangGraph๋Š” ๋‹ค๋‹จ๊ณ„ ์›Œํฌํ”Œ๋กœ(์—์ด์ „ํŠธ ๋ฃจํ”„, ๊ณ„ํš)๋ฅผ ์œ„ํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

Ollama์™€ vLLM ๋ฐฑ์—”๋“œ๋ฅผ ํ˜ผํ•ฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๊นŒ?

์˜ˆ. ์˜ˆ๋ฅผ ๋“ค์–ด, Ollama๋Š” ์ฑ„ํŒ… UI์šฉ, vLLM์€ ๋ฐฐ์น˜ API์šฉ์œผ๋กœ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ๊ฐ™์€ ๋จธ์‹ ์—์„œ ์„œ๋กœ ๋‹ค๋ฅธ ํฌํŠธ๋กœ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๊ด€๋ จ ์ฝ์„๊ฑฐ๋ฆฌ

LLM ์Šคํƒ ์„ ํƒ ์‹œ ์ผ๋ฐ˜์ ์ธ ์‹ค์ˆ˜

  • vLLM ์—†์ด Ollama๋ฅผ ํ”„๋กœ๋•์…˜ API์— ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ: Ollama๋Š” ์ดˆ๋‹น 10๊ฑด ๋ฏธ๋งŒ์œผ๋กœ ํ•œ๊ณ„๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ๋™์‹œ ์‚ฌ์šฉ์ž 10๋ช… ์ด์ƒ์„ ์„œ๋น™ํ•˜๋Š” ํ”„๋กœ๋•์…˜ ํ™˜๊ฒฝ์—์„œ๋Š” vLLM์ด ํ•„์ˆ˜์ž…๋‹ˆ๋‹ค. ๋ฐฐํฌ ์ „์— ๋ถ€ํ•˜ ํ…Œ์ŠคํŠธ๋กœ ์ฒ˜๋ฆฌ๋Ÿ‰์„ ํ™•์ธํ•˜์‹ญ์‹œ์˜ค.
  • vLLM ๋ฐฑ์—”๋“œ ์—†์ด LangGraph๋ฅผ ์‹คํ–‰ํ•˜๋Š” ๊ฒƒ: LangGraph ์—์ด์ „ํŠธ๋Š” ์ˆœ์ฐจ์  LLM ํ˜ธ์ถœ์„ 10ํšŒ ์ด์ƒ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค. Ollama๋Š” ์ง€์—ฐ ์‹œ๊ฐ„ ๋ณ‘๋ชฉ ํ˜„์ƒ์„ ์œ ๋ฐœํ•ฉ๋‹ˆ๋‹ค. 1์ดˆ ๋ฏธ๋งŒ์˜ ์™•๋ณต ์‹œ๊ฐ„์„ ์œ„ํ•ด ํ•ญ์ƒ LangGraph์™€ vLLM์„ ํ•จ๊ป˜ ์‚ฌ์šฉํ•˜์‹ญ์‹œ์˜ค.
  • ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ ์—†์ด ๊ฐ™์€ GPU์—์„œ Ollama + vLLM์„ ํ˜ผํ•ฉํ•˜๋Š” ๊ฒƒ: ๋‘ ๋„๊ตฌ ๋ชจ๋‘ VRAM์— ๊ฐ€์ค‘์น˜๋ฅผ ๋กœ๋“œํ•ฉ๋‹ˆ๋‹ค. 70B ๋ชจ๋ธ ๋‘ ์ธ์Šคํ„ด์Šค๋ฅผ ์‹คํ–‰ํ•˜๋ฉด 32 GB๋ฅผ ์†Œ๋น„ํ•ฉ๋‹ˆ๋‹ค. ๋ณ„๋„์˜ GPU๋ฅผ ์‚ฌ์šฉํ•˜๊ฑฐ๋‚˜ ๋‘ ๋ชจ๋ธ์„ ๋ชจ๋‘ ์ˆ˜์šฉํ•˜๊ธฐ ์œ„ํ•ด ๋งŽ์ด ์–‘์žํ™”(Q2)ํ•˜์‹ญ์‹œ์˜ค.
  • ๊ธ€์“ฐ๊ธฐ์— ์ž˜๋ชป๋œ ์ปจํ…์ŠคํŠธ ์ฐฝ์„ ์„ ํƒํ•˜๋Š” ๊ฒƒ: ๊ธฐ๋ณธ 4K ์ปจํ…์ŠคํŠธ๋Š” ๋ธŒ๋ ˆ์ธ์Šคํ† ๋ฐ ์„ธ์…˜์„ ์ œํ•œํ•ฉ๋‹ˆ๋‹ค. ์žฅ๋ฌธ ๊ธ€์“ฐ๊ธฐ์˜ ๊ฒฝ์šฐ OpenWebUI์—์„œ 16K~32K ์ปจํ…์ŠคํŠธ ์ฐฝ์„ ์„ค์ •ํ•˜์‹ญ์‹œ์˜ค. ํŠธ๋ ˆ์ด๋“œ์˜คํ”„: ์ถ”๋ก  ์†๋„๊ฐ€ ๋А๋ ค์ง‘๋‹ˆ๋‹ค(ํ† ํฐ๋‹น 2~3๋ฐฐ ๋А๋ฆผ).
  • ๋ชจ๋“  ๋ฐฑ์—”๋“œ์˜ ์†๋„๊ฐ€ ๋™์ผํ•˜๋‹ค๊ณ  ๊ฐ€์ •ํ•˜๋Š” ๊ฒƒ: vLLM๊ณผ Ollama๋Š” ์„œ๋กœ ๋‹ค๋ฅธ ์ปค๋„์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ๋™์ผํ•œ ํ•˜๋“œ์›จ์–ด์—์„œ vLLM์€ ์ถ”๋ก ์—์„œ 2~3๋ฐฐ ๋” ๋น ๋ฆ…๋‹ˆ๋‹ค. ์†๋„ ์ฐจ์ด๋Š” ๋ฐฑ์—”๋“œ์— ์žˆ์œผ๋ฉฐ, ํ”„๋ก ํŠธ์—”๋“œ(OpenWebUI, LM Studio๋Š” ๋‹จ์ˆœํ•œ UI)์—๋Š” ์—†์Šต๋‹ˆ๋‹ค.

์ถœ์ฒ˜

  • Ollama GitHub โ€” ๊ณต์‹ ๋ฌธ์„œ, ์ŠคํŠธ๋ฆฌ๋ฐ API ์‚ฌ์–‘, ๋ชจ๋ธ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ.
  • vLLM GitHub โ€” OpenAI API ํ˜ธํ™˜์„ฑ, ๋ฐฐ์น˜ ์ฒ˜๋ฆฌ, ์—ฐ์† ๋ฐฐ์น˜ ๋ฌธ์„œ.
  • Qwen3-Coder ๊ธฐ์ˆ  ๋ณด๊ณ ์„œ โ€” Alibaba Qwen. HumanEval ์ ์ˆ˜ 82%, ์ฝ”๋”ฉ ์ž‘์—… ํŠนํ™”. Apache 2.0 ๋ผ์ด์„ ์Šค.
  • LlamaIndex ๋ฌธ์„œ โ€” ๋ฌธ์„œ ์ธ๋ฑ์‹ฑ, ์ฒญํ‚น, RAG ๊ฒ€์ƒ‰ ํ”„๋ ˆ์ž„์›Œํฌ.
  • LangGraph ๋ฌธ์„œ โ€” ์—์ด์ „ํŠธ ์›Œํฌํ”Œ๋กœ ํ”„๋ ˆ์ž„์›Œํฌ, ์ƒํƒœ ๋จธ์‹ , ๋„๊ตฌ ์‚ฌ์šฉ ํŒจํ„ด.
  • Qdrant ๋ฌธ์„œ โ€” ๋กœ์ปฌ ์ž„๋ฒ ๋”ฉ ์ €์žฅ์„ ์œ„ํ•œ ๋ฒกํ„ฐ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค, Docker ์ง€์›, Apache 2.0.
  • Continue.dev ๋ฌธ์„œ โ€” ๋กœ์ปฌ LLM ๋ฐฑ์—”๋“œ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” VS Code ๋ฐ JetBrains์šฉ IDE ํ™•์žฅ.

A Note on Third-Party Facts

This article references third-party AI models, benchmarks, prices, and licenses. The AI landscape changes rapidly. Benchmark scores, license terms, model names, and API prices can shift between the time of writing and the time you read this. Before making deployment or compliance decisions based on this article, verify current figures on each providerโ€™s official source: Hugging Face model cards for licenses and benchmarks, provider websites for API pricing, and EUR-Lex for current GDPR and EU AI Act text. This article reflects publicly available information as of May 2026.

Run PromptQuorum with a local LLM, your own API keys, or both โ€” you pick the backend.

Join the PromptQuorum Waitlist โ†’

โ† Back to Local LLMs

์ตœ์  ๋กœ์ปฌ LLM ์Šคํƒ 2026: ์ฝ”๋”ฉ, RAG, ๊ธ€์“ฐ๊ธฐ & ์—์ด์ „ํŠธ | PromptQuorum