Skip to main content
PromptQuorumPromptQuorum
Home/Local LLMs/๋กœ์ปฌ LLM ์†๋„๋ฅผ ๋‘ ๋ฐฐ๋กœ ๋†’์ด๋Š” ๋ฐฉ๋ฒ•: ์ตœ์ ํ™” ๊ธฐ๋ฒ•
ํ•˜๋“œ์›จ์–ด ๋ฐ ์„ฑ๋Šฅ

๋กœ์ปฌ LLM ์†๋„๋ฅผ ๋‘ ๋ฐฐ๋กœ ๋†’์ด๋Š” ๋ฐฉ๋ฒ•: ์ตœ์ ํ™” ๊ธฐ๋ฒ•

ยท10๋ถ„ ์ฝ๊ธฐยทBy Hans Kuepper ยท Founder of PromptQuorum, multi-model AI dispatch tool ยท PromptQuorum

์˜ฌ๋ฐ”๋ฅธ ์ตœ์ ํ™”๋ฅผ ์ ์šฉํ•˜๋ฉด ๋กœ์ปฌ LLM์„ 2~3๋ฐฐ ๋” ๋น ๋ฅด๊ฒŒ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ฃผ์š” ๊ธฐ๋ฒ•์œผ๋กœ๋Š” ๋กœ๊น… ๋น„ํ™œ์„ฑํ™”, ๋ฐฐ์น˜ ํฌ๊ธฐ ์ถ•์†Œ, ์–‘์žํ™” ์ตœ์ ํ™”, ๋” ๋น ๋ฅธ ์ถ”๋ก  ์—”์ง„ ์‚ฌ์šฉ, GPU ๋ฉ”๋ชจ๋ฆฌ ํŠœ๋‹์ด ์žˆ์Šต๋‹ˆ๋‹ค.

์˜ฌ๋ฐ”๋ฅธ ์ตœ์ ํ™”๋ฅผ ์ ์šฉํ•˜๋ฉด ๋กœ์ปฌ LLM์„ 2~3๋ฐฐ ๋” ๋น ๋ฅด๊ฒŒ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ฃผ์š” ๊ธฐ๋ฒ•์œผ๋กœ๋Š” ๋กœ๊น… ๋น„ํ™œ์„ฑํ™”, ๋ฐฐ์น˜ ํฌ๊ธฐ ์ถ•์†Œ, ์–‘์žํ™” ์ตœ์ ํ™”, ๋” ๋น ๋ฅธ ์ถ”๋ก  ์—”์ง„ ์‚ฌ์šฉ, GPU ๋ฉ”๋ชจ๋ฆฌ ํŠœ๋‹ ๋“ฑ์ด ์žˆ์Šต๋‹ˆ๋‹ค. 2026๋…„ 4์›” ๊ธฐ์ค€์œผ๋กœ ์ด๋Ÿฌํ•œ ๊ธฐ๋ฒ•์„ ์กฐํ•ฉํ•˜๋ฉด ํ’ˆ์งˆ ์†์‹ค ์—†์ด ์•ฝ 2๋ฐฐ์˜ ์†๋„ ํ–ฅ์ƒ์„ ๋‹ฌ์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Key Takeaways

  • ๋กœ๊น…/๋””๋ฒ„๊น… ๋น„ํ™œ์„ฑํ™” (์‰ฌ์›€): ์•ฝ 10% ์†๋„ ํ–ฅ์ƒ.
  • Q4 ์–‘์žํ™” ์‚ฌ์šฉ (์‰ฌ์›€): ๋™์ผํ•œ ์†๋„, ๋” ์ ์€ VRAM.
  • ๋ฐฐ์น˜ ํฌ๊ธฐ ์ตœ์ ํ™” (์ค‘๊ฐ„): ๋ฐฐ์น˜ ์ฒ˜๋ฆฌ ์‹œ 2~3๋ฐฐ ์†๋„ ํ–ฅ์ƒ.
  • Ollama ๋Œ€์‹  vLLM ์‚ฌ์šฉ (์–ด๋ ค์›€): ๋™์‹œ ์š”์ฒญ ์‹œ 2~5๋ฐฐ ์†๋„ ํ–ฅ์ƒ.
  • GPU ๋ฉ”๋ชจ๋ฆฌ ํ™œ์šฉ๋ฅ  90% ์ด์ƒ (์ค‘๊ฐ„): 15~20% ์†๋„ ํ–ฅ์ƒ.
  • ๋ชจ๋“  ๊ธฐ๋ฒ• ์กฐํ•ฉ ์‹œ: ์ด ์•ฝ 2~3๋ฐฐ ์†๋„ ํ–ฅ์ƒ.

GPU ๋ฉ”๋ชจ๋ฆฌ ํ™œ์šฉ๋ฅ ์ด ์†๋„์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ

๋Œ€๋ถ€๋ถ„์˜ ๋„๊ตฌ๋Š” ๊ธฐ๋ณธ์ ์œผ๋กœ GPU VRAM์˜ 70~80%๋งŒ ์‚ฌ์šฉํ•˜์—ฌ ๋‚˜๋จธ์ง€ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์œ ํœด ์ƒํƒœ๋กœ ๋ฐฉ์น˜ํ•ฉ๋‹ˆ๋‹ค. 90~95%๋กœ ๋Š˜๋ฆฌ๋ฉด ์—”์ง„์ด ๋” ๋งŽ์€ KV ์บ์‹œ๋ฅผ ๋ฏธ๋ฆฌ ํ• ๋‹นํ•  ์ˆ˜ ์žˆ์–ด 15~20% ์†๋„ ํ–ฅ์ƒ์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค:

bash
# vLLM: increase GPU memory utilization
vllm serve meta-llama/Llama-2-7b-hf \
  --gpu-memory-utilization 0.95

# Ollama: environment variable
export OLLAMA_GPU_THRESHOLD=0.95  # Use 95% of GPU
ollama run llama3.2:3b

# LM Studio: Settings โ†’ GPU acceleration slider (move to 100%)

์ฒ˜๋ฆฌ๋Ÿ‰์„ ๊ทน๋Œ€ํ™”ํ•˜๋Š” ๋ฐฐ์น˜ ํฌ๊ธฐ

๋ฐฐ์น˜ ์ฒ˜๋ฆฌ(๋ณต์ˆ˜ ํ”„๋กฌํ”„ํŠธ)์˜ ๊ฒฝ์šฐ, ๋ฐฐ์น˜ ํฌ๊ธฐ๋ฅผ 1์—์„œ 32๋กœ ๋Š˜๋ฆฌ๋ฉด ์ฒ˜๋ฆฌ๋Ÿ‰์ด 2~4๋ฐฐ ํ–ฅ์ƒ๋ฉ๋‹ˆ๋‹ค.

๋‹จ์ผ ์š”์ฒญ = ์ œํ•œ๋œ ํŒŒ์ดํ”„๋ผ์ธ ํ™œ์šฉ. 32๊ฐœ ์š”์ฒญ ๋ฐฐ์น˜ = 2~4๋ฐฐ ์ฒ˜๋ฆฌ๋Ÿ‰.

ํŠธ๋ ˆ์ด๋“œ์˜คํ”„: ๊ฐœ๋ณ„ ์š”์ฒญ๋‹น ์ง€์—ฐ ์‹œ๊ฐ„์ด ๋†’์•„์ง‘๋‹ˆ๋‹ค(๋ฐฐ์น˜ ์™„๋ฃŒ๋ฅผ ๊ธฐ๋‹ค๋ ค์•ผ ํ•จ).

Batch SizeThroughputLatency/RequestUse Case
1 (๋‹จ์ผ)50 tokens/sec์ตœ์†Œ์‹ค์‹œ๊ฐ„ ์ฑ„ํŒ…
8120 tokens/secํ—ˆ์šฉ ๊ฐ€๋Šฅ๊ฒฝ๋Ÿ‰ ๋™์‹œ์„ฑ
32200 tokens/sec๋†’์Œ๋ฐฐ์น˜ API
64+250+ tokens/sec๋งค์šฐ ๋†’์Œ์˜คํ”„๋ผ์ธ ๋ฐฐ์น˜

๊ฐ€์žฅ ๋น ๋ฅธ ์ถ”๋ก  ์—”์ง„: vLLM vs Ollama vs llama.cpp

vLLM: ๋™์‹œ ์š”์ฒญ ์ฒ˜๋ฆฌ ์‹œ Ollama๋ณด๋‹ค 5~10๋ฐฐ ๋น ๋ฆ„ โ€” ๋‹ค์ˆ˜์˜ ์‚ฌ์šฉ์ž์—๊ฒŒ ์„œ๋น„์Šคํ•˜๋Š” ํ”„๋กœ๋•์…˜ API์— ์ ํ•ฉ.

llama.cpp: ์†Œ๋น„์ž์šฉ ํ•˜๋“œ์›จ์–ด์—์„œ ๋‹จ์ผ ์š”์ฒญ ์ฒ˜๋ฆฌ๊ฐ€ ๊ฐ€์žฅ ๋น ๋ฆ„ โ€” ๊ฐœ์ธ ๋กœ์ปฌ ํ™˜๊ฒฝ์— ์ ํ•ฉ.

Ollama: ๋‹จ์ผ ์‚ฌ์šฉ์ž ํ™˜๊ฒฝ์—์„œ ์ตœ๊ณ ์˜ ๊ฐœ๋ฐœ์ž ๊ฒฝํ—˜ ์ œ๊ณต; ๋‹จ์ผ ์š”์ฒญ์—์„œ llama.cpp์™€ ๋น„์Šทํ•œ ์„ฑ๋Šฅ.

Text-Generation-WebUI: ๊ฐ€์žฅ ๋А๋ฆฌ์ง€๋งŒ ๊ธฐ๋Šฅ์ด ๊ฐ€์žฅ ๋งŽ์Œ โ€” ์‹คํ—˜์šฉ์—๋งŒ ์ ํ•ฉํ•˜๋ฉฐ ํ”„๋กœ๋•์…˜์—๋Š” ๋ถ€์ ํ•ฉ.

์–‘์žํ™”๊ฐ€ ์‹ค์ œ๋กœ ์ถ”๋ก  ์†๋„๋ฅผ ๋†’์ด๋Š”๊ฐ€?

์ตœ์‹  GPU(RTX 40 ์‹œ๋ฆฌ์ฆˆ)์—์„œ Q4์™€ Q5๋Š” FP16๊ณผ ๋™์ผํ•œ ์†๋„๋กœ ์‹คํ–‰๋ฉ๋‹ˆ๋‹ค โ€” ์†๋„๊ฐ€ ์•„๋‹Œ VRAM ์ ˆ๊ฐ์„ ์œ„ํ•ด ์–‘์žํ™”ํ•˜์‹ญ์‹œ์˜ค.

์–‘์žํ™”์˜ ๊ฐ„์ ‘์ ์ธ ์†๋„ ์ด์ :

  • ๋” ์ž‘์€ ๋ชจ๋ธ ํŒŒ์ผ = ๋””์Šคํฌ์—์„œ ๋” ๋น ๋ฅธ ์ฝœ๋“œ ์Šคํƒ€ํŠธ ๋กœ๋”ฉ
  • ์ค„์–ด๋“  ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ = ๊ตฌํ˜• ๋˜๋Š” ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ์ œํ•œ๋œ ํ•˜๋“œ์›จ์–ด์—์„œ ์•ฝ 10~15% ๋” ๋น ๋ฆ„

์–‘์žํ™”๋Š” ์ฃผ๋กœ VRAM ์ ˆ๊ฐ์„ ์œ„ํ•œ ๊ฒƒ์ด๋ฉฐ ์›์‹œ ํ† ํฐ ์ฒ˜๋ฆฌ๋Ÿ‰ ํ–ฅ์ƒ์„ ์œ„ํ•œ ๊ฒƒ์ด ์•„๋‹™๋‹ˆ๋‹ค.

ํ˜„์‹ค์ ์œผ๋กœ ๊ธฐ๋Œ€ํ•  ์ˆ˜ ์žˆ๋Š” ์†๋„ ํ–ฅ์ƒ

์˜ˆ์‹œ: RTX 4090์—์„œ 7B ๋ชจ๋ธ ์ตœ์ ํ™” โ€” ๋‹จ๊ณ„๋ณ„:

ChangeSpeedCumulative Gain
๊ธฐ๋ณธ Ollama (๊ธฐ์ค€)120 tok/secโ€”
๋””๋ฒ„๊ทธ ๋กœ๊น… ๋น„ํ™œ์„ฑํ™”132 tok/sec+10%
GPU ๋ฉ”๋ชจ๋ฆฌ โ†’ 95%150 tok/sec+25% ํ•ฉ๊ณ„
vLLM์œผ๋กœ ์ „ํ™˜ (๋ฐฐ์น˜)300 tok/sec (๋ฐฐ์น˜)+2.5ร— (๋ฐฐ์น˜)
๋ชจ๋“  ์ตœ์ ํ™” ์ ์šฉ300 tok/sec+2.5ร— ์ฒ˜๋ฆฌ๋Ÿ‰

ํ”ํ•œ ์†๋„ ์ตœ์ ํ™” ์‹ค์ˆ˜

  • GPU ๋ฉ”๋ชจ๋ฆฌ๋ฅผ 100%๋กœ ์„ค์ •. ๋ฉ”๋ชจ๋ฆฌ ๋ถ€์กฑ ์ถฉ๋Œ ์œ„ํ—˜์ด ์žˆ์Šต๋‹ˆ๋‹ค. ์•ˆ์ „ํ•œ ์ตœ๋Œ€๊ฐ’์€ 90~95%์ž…๋‹ˆ๋‹ค.
  • ์†๋„๋ฅผ ์œ„ํ•ด ๋ฐฐ์น˜ ํฌ๊ธฐ ๋‚ฎ์ถ”๊ธฐ. ๋ฐฐ์น˜ ํฌ๊ธฐ๋Š” ๋‹จ์ผ ์š”์ฒญ ์ง€์—ฐ ์‹œ๊ฐ„์— ์˜ํ–ฅ์„ ์ฃผ์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ์ฒ˜๋ฆฌ๋Ÿ‰์—๋งŒ ๋„์›€์ด ๋ฉ๋‹ˆ๋‹ค.
  • ์†๋„๋ฅผ ์œ„ํ•ด ๊ณผ๋„ํ•œ ์–‘์žํ™”. Q4๋Š” FP16๊ณผ ์†๋„๊ฐ€ ๊ฑฐ์˜ ๋™์ผํ•ฉ๋‹ˆ๋‹ค. ์†๋„๊ฐ€ ์•„๋‹Œ VRAM์„ ์œ„ํ•ด ์–‘์žํ™”ํ•˜์‹ญ์‹œ์˜ค.
  • ๋ฐฐํฌ ๋„์ค‘ ์ถ”๋ก  ์—”์ง„ ๋ณ€๊ฒฝ. Ollama โ†’ vLLM โ†’ llama.cpp ์ „ํ™˜์€ ๋ฒ„๊ทธ๋ฅผ ์œ ๋ฐœํ•ฉ๋‹ˆ๋‹ค. ํ•˜๋‚˜๋ฅผ ์„ ํƒํ•˜์—ฌ ์ตœ์ ํ™”ํ•˜์‹ญ์‹œ์˜ค.

์ž์ฃผ ๋ฌป๋Š” ์งˆ๋ฌธ

๋กœ์ปฌ LLM ์ถ”๋ก  ์†๋„๋ฅผ ๋†’์ด๋Š” ๊ฐ€์žฅ ํšจ๊ณผ์ ์ธ ๋‹จ์ผ ๋ฐฉ๋ฒ•์€ ๋ฌด์—‡์ž…๋‹ˆ๊นŒ?

๋™์‹œ ์š”์ฒญ ์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•ด Ollama์—์„œ vLLM์œผ๋กœ ์ „ํ™˜ํ•˜๋ฉด ๊ฐ€์žฅ ํฐ ๋‹จ์ผ ์†๋„ ํ–ฅ์ƒ์„ ์–ป์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค โ€” ๋ฐฐ์น˜ ์ฒ˜๋ฆฌ ์‹œ 5~10๋ฐฐ ์ฒ˜๋ฆฌ๋Ÿ‰ ํ–ฅ์ƒ. ๋‹จ์ผ ์š”์ฒญ์˜ ๊ฒฝ์šฐ, GPU ๋ฉ”๋ชจ๋ฆฌ ํ™œ์šฉ๋ฅ ์„ 70%์—์„œ 90~95%๋กœ ๋†’์ด๋ฉด 15~20% ์†๋„ ํ–ฅ์ƒ์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค. ๋””๋ฒ„๊ทธ ๋กœ๊น… ๋น„ํ™œ์„ฑํ™”๋กœ ์ถ”๊ฐ€ 10%๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋ฐฐ์น˜ ์ฒ˜๋ฆฌ๊ฐ€ ๋‹จ์ผ ์š”์ฒญ ์ง€์—ฐ ์‹œ๊ฐ„์„ ๊ฐœ์„ ํ•ฉ๋‹ˆ๊นŒ?

์•„๋‹™๋‹ˆ๋‹ค โ€” ๋ฐฐ์น˜ ํฌ๊ธฐ๋Š” ์ฒ˜๋ฆฌ๋Ÿ‰(๋ชจ๋“  ์š”์ฒญ์— ๊ฑธ์นœ ์ดˆ๋‹น ์ด ํ† ํฐ ์ˆ˜)์— ์˜ํ–ฅ์„ ๋ฏธ์น˜๋ฉฐ ๋‹จ์ผ ์š”์ฒญ ์ง€์—ฐ ์‹œ๊ฐ„์—๋Š” ์˜ํ–ฅ์„ ์ฃผ์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ํ•œ ์š”์ฒญ์˜ ์ง€์—ฐ ์‹œ๊ฐ„์„ ์ค„์ด๋ ค๋ฉด GPU ๋ฉ”๋ชจ๋ฆฌ ํ™œ์šฉ๋ฅ ์„ ์ตœ์ ํ™”ํ•˜๊ณ  ๋” ๋น ๋ฅธ ์—”์ง„(vLLM ๋˜๋Š” llama.cpp)์„ ์‚ฌ์šฉํ•˜์‹ญ์‹œ์˜ค. ๋ฐฐ์น˜ ํฌ๊ธฐ๊ฐ€ ํด์ˆ˜๋ก ์š”์ฒญ๋‹น ๋Œ€๊ธฐ ์‹œ๊ฐ„์ด ๋Š˜์–ด๋‚ฉ๋‹ˆ๋‹ค.

vLLM์€ Ollama๋ณด๋‹ค ์–ผ๋งˆ๋‚˜ ๋น ๋ฆ…๋‹ˆ๊นŒ?

๋‹จ์ผ ์š”์ฒญ์˜ ๊ฒฝ์šฐ vLLM๊ณผ Ollama๋Š” ๋น„์Šทํ•œ ์„ฑ๋Šฅ์„ ๋ณด์ž…๋‹ˆ๋‹ค(RTX 4090์—์„œ 7B ๋ชจ๋ธ๋กœ ์•ฝ 120~150 tok/sec). ๋™์‹œ ์š”์ฒญ์˜ ๊ฒฝ์šฐ vLLM์€ ์—ฐ์† ๋ฐฐ์น˜ ์ฒ˜๋ฆฌ์™€ PagedAttention ๋•๋ถ„์— 5~10๋ฐฐ ๋น ๋ฆ…๋‹ˆ๋‹ค. ๊ฐœ์ธ/๋‹จ์ผ ์‚ฌ์šฉ์ž ํ™˜๊ฒฝ์—๋Š” Ollama๋ฅผ, ๋‹ค์ˆ˜์˜ ์‚ฌ์šฉ์ž์—๊ฒŒ ์„œ๋น„์Šคํ•˜๋Š” API์—๋Š” vLLM์œผ๋กœ ์ „ํ™˜ํ•˜์‹ญ์‹œ์˜ค.

์–‘์žํ™”๊ฐ€ ์ถ”๋ก  ์†๋„๋ฅผ ๋†’์ž…๋‹ˆ๊นŒ?

์–‘์žํ™”์˜ ์ฃผ์š” ์ด์ ์€ ์†๋„๊ฐ€ ์•„๋‹Œ VRAM ์ ˆ๊ฐ์ž…๋‹ˆ๋‹ค. ์ตœ์‹  NVIDIA GPU(RTX 40 ์‹œ๋ฆฌ์ฆˆ)์—์„œ Q4์™€ Q5๋Š” FP16๊ณผ ๋™์ผํ•œ ์†๋„๋กœ ์‹คํ–‰๋ฉ๋‹ˆ๋‹ค. ๊ฐ„์ ‘์ ์ธ ์†๋„ ์ด์ : ๋” ์ž‘์€ Q4 ๋ชจ๋ธ์€ ๋””์Šคํฌ์—์„œ ๋” ๋น ๋ฅด๊ฒŒ ๋กœ๋“œ๋˜๋ฉฐ ๋™์ผํ•œ VRAM ๋‚ด์—์„œ ์•ฝ๊ฐ„ ๋” ํฐ ๋ฐฐ์น˜ ํฌ๊ธฐ๋ฅผ ํ—ˆ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ตœ๋Œ€ ์†๋„๋ฅผ ์œ„ํ•ด GPU ๋ฉ”๋ชจ๋ฆฌ ํ™œ์šฉ๋ฅ ์„ ์–ด๋–ป๊ฒŒ ์„ค์ •ํ•ด์•ผ ํ•ฉ๋‹ˆ๊นŒ?

vLLM์—์„œ GPU ๋ฉ”๋ชจ๋ฆฌ ํ™œ์šฉ๋ฅ ์„ 90~95%๋กœ ์„ค์ •ํ•˜์‹ญ์‹œ์˜ค(`--gpu-memory-utilization 0.92`). ์ด๋ ‡๊ฒŒ ํ•˜๋ฉด ์—”์ง„์ด KV ์บ์‹œ๋ฅผ ์œ„ํ•ด ๋” ๋งŽ์€ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ๋ฏธ๋ฆฌ ํ• ๋‹นํ•˜์—ฌ ์ฒ˜๋ฆฌ๋Ÿ‰์ด ํ–ฅ์ƒ๋ฉ๋‹ˆ๋‹ค. 100%๋Š” ํ”ผํ•˜์‹ญ์‹œ์˜ค โ€” ์ƒ์„ฑ์ด ์˜ˆ์ธก์„ ์ดˆ๊ณผํ•  ๋•Œ OOM ์ถฉ๋Œ์ด ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค. 5~10%์˜ ์•ˆ์ „ ๋งˆ์ง„์€ ํ•„์ˆ˜์ž…๋‹ˆ๋‹ค.

์ฒซ ๋ฒˆ์งธ ํ”„๋กฌํ”„ํŠธ ์ดํ›„ ๋กœ์ปฌ LLM์ด ์™œ ๋А๋ ค์ง‘๋‹ˆ๊นŒ?

์ฒซ ๋ฒˆ์งธ ํ”„๋กฌํ”„ํŠธ๋Š” ๋ชจ๋ธ์„ VRAM์— ๋กœ๋“œํ•ฉ๋‹ˆ๋‹ค(์ฝœ๋“œ ์Šคํƒ€ํŠธ). ์ด ๊ณผ์ •์ด 10~30์ดˆ ๊ฑธ๋ฆด ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ดํ›„ ํ”„๋กฌํ”„ํŠธ๋Š” ์ „์†๋ ฅ์œผ๋กœ ์‹คํ–‰๋ฉ๋‹ˆ๋‹ค. ์„ธ์…˜ ์‚ฌ์ด์— ์„œ๋ฒ„๋ฅผ ์žฌ์‹œ์ž‘ํ•˜์ง€ ๋งˆ์‹ญ์‹œ์˜ค. Ollama์˜ ๊ฒฝ์šฐ ๋น„ํ™œ์„ฑ ํ›„ ๋ชจ๋ธ ์–ธ๋กœ๋”ฉ์„ ๋ฐฉ์ง€ํ•˜๋ ค๋ฉด OLLAMA_KEEP_ALIVE=24h๋ฅผ ์„ค์ •ํ•˜์‹ญ์‹œ์˜ค.

CPU ์ „์šฉ ์ถ”๋ก ์„ ์˜๋ฏธ ์žˆ๊ฒŒ ๊ฐ€์†ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๊นŒ?

์ œํ•œ์ ์ธ ํ–ฅ์ƒ์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค: llama.cpp์—์„œ -t ํ”Œ๋ž˜๊ทธ๋กœ ์Šค๋ ˆ๋“œ ์ˆ˜๋ฅผ ๋ฌผ๋ฆฌ์  ์ฝ”์–ด ์ˆ˜(๋…ผ๋ฆฌ์  ์ฝ”์–ด ์ˆ˜ ์•„๋‹˜)๋กœ ์„ค์ •ํ•˜๊ณ , AVX2/AVX-512 ๋ช…๋ น์–ด ์„ธํŠธ๋ฅผ ํ™œ์„ฑํ™”ํ•˜๋ฉฐ, Q4_K_M ์–‘์žํ™”๋ฅผ ์‚ฌ์šฉํ•˜์‹ญ์‹œ์˜ค. ์ตœ์‹  i9์—์„œ ํ˜„์‹ค์ ์ธ ์ƒํ•œ์„ ์€ 8~12 tok/sec์ž…๋‹ˆ๋‹ค. ๋Œ€ํ™”ํ˜• ์ฑ„ํŒ…์—์„œ ํ—ˆ์šฉ ๊ฐ€๋Šฅํ•œ ์ง€์—ฐ ์‹œ๊ฐ„์„ ๋‹ฌ์„ฑํ•˜๋ ค๋ฉด GPU ํ•˜๋“œ์›จ์–ด๊ฐ€ ์œ ์ผํ•œ ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค.

์ปจํ…์ŠคํŠธ ๊ธธ์ด๊ฐ€ ์ถ”๋ก  ์†๋„์— ์–ด๋–ค ์˜ํ–ฅ์„ ๋ฏธ์นฉ๋‹ˆ๊นŒ?

์–ดํ…์…˜ ๋ฉ”์ปค๋‹ˆ์ฆ˜์ด ์ปจํ…์ŠคํŠธ ๊ธธ์ด์— ๋Œ€ํ•ด ์ด์ฐจ์ ์œผ๋กœ ํ™•์žฅ๋˜๊ธฐ ๋•Œ๋ฌธ์— ์ปจํ…์ŠคํŠธ ์ฐฝ์ด ๊ธธ์ˆ˜๋ก ์ถ”๋ก ์ด ๋А๋ ค์ง‘๋‹ˆ๋‹ค. 4K ์ปจํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ๋Š” 1K ํ”„๋กฌํ”„ํŠธ๋ณด๋‹ค ์•ฝ 4๋ฐฐ ๋” ๋А๋ฆฌ๊ฒŒ ์ฒ˜๋ฆฌ๋ฉ๋‹ˆ๋‹ค. ์‹œ์Šคํ…œ ํ”„๋กฌํ”„ํŠธ๋Š” 500 ํ† ํฐ ๋ฏธ๋งŒ์œผ๋กœ ์œ ์ง€ํ•˜๊ณ  ์†๋„๋ฅผ ์œ ์ง€ํ•˜๋ ค๋ฉด ๊ธด ๋Œ€ํ™”์— ์ปจํ…์ŠคํŠธ ์š”์•ฝ์„ ์‚ฌ์šฉํ•˜์‹ญ์‹œ์˜ค.

PagedAttention์ด๋ž€ ๋ฌด์—‡์ด๋ฉฐ ์™œ vLLM์˜ ์†๋„๋ฅผ ๋†’์ž…๋‹ˆ๊นŒ?

PagedAttention์€ vLLM์˜ KV ์บ์‹œ ๊ด€๋ฆฌ ์‹œ์Šคํ…œ์ž…๋‹ˆ๋‹ค. ์š”์ฒญ๋‹น ๊ณ ์ •๋œ ๋ฉ”๋ชจ๋ฆฌ ๋ธ”๋ก์„ ๋ฏธ๋ฆฌ ํ• ๋‹นํ•˜๋Š” ๋Œ€์‹  OS์˜ ๊ฐ€์ƒ ๋ฉ”๋ชจ๋ฆฌ์ฒ˜๋Ÿผ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ๋™์ ์œผ๋กœ ํŽ˜์ด์ง•ํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด VRAM ๋‹จํŽธํ™”๋ฅผ ์ œ๊ฑฐํ•˜๊ณ  ๋” ๋งŽ์€ ๋™์‹œ ์š”์ฒญ์„ ํ—ˆ์šฉํ•˜๋ฉฐ GPU ํ™œ์šฉ๋ฅ ์„ ์•ฝ 55%(๊ธฐ๋ณธ)์—์„œ 90% ์ด์ƒ์œผ๋กœ ํ–ฅ์ƒ์‹œํ‚ต๋‹ˆ๋‹ค.

GGUF์™€ safetensors ๋ชจ๋ธ ํ˜•์‹ ๊ฐ„์— ์†๋„ ์ฐจ์ด๊ฐ€ ์žˆ์Šต๋‹ˆ๊นŒ?

์žˆ์Šต๋‹ˆ๋‹ค. GGUF(llama.cpp ๋ฐ Ollama์—์„œ ์‚ฌ์šฉ)๋Š” ๋‚ด์žฅ ์–‘์žํ™”๋ฅผ ํ†ตํ•œ CPU/์†Œ๋น„์ž์šฉ GPU ์ถ”๋ก ์— ์ตœ์ ํ™”๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. Safetensors(vLLM ๋ฐ HuggingFace์—์„œ ์‚ฌ์šฉ)๋Š” ์ „์ •๋ฐ€๋„ GPU ์ถ”๋ก ์— ๋” ๋น ๋ฆ…๋‹ˆ๋‹ค. FP16์„ ์‹คํ–‰ํ•˜๋Š” RTX 40 ์‹œ๋ฆฌ์ฆˆ GPU์—์„œ safetensors + vLLM์€ ์ผ๋ฐ˜์ ์œผ๋กœ GGUF + Ollama๋ณด๋‹ค 10~20% ๋” ๋น ๋ฆ…๋‹ˆ๋‹ค.

์ถœ์ฒ˜

  • vLLM Optimization Guide -- docs.vllm.ai/en/dev_guide/performance_tuning.html
  • Ollama Performance Tips -- github.com/ollama/ollama/blob/main/docs/troubleshooting.md

A Note on Third-Party Facts

This article references third-party AI models, benchmarks, prices, and licenses. The AI landscape changes rapidly. Benchmark scores, license terms, model names, and API prices can shift between the time of writing and the time you read this. Before making deployment or compliance decisions based on this article, verify current figures on each providerโ€™s official source: Hugging Face model cards for licenses and benchmarks, provider websites for API pricing, and EUR-Lex for current GDPR and EU AI Act text. This article reflects publicly available information as of May 2026.

Run PromptQuorum with a local LLM, your own API keys, or both โ€” you pick the backend.

Join the PromptQuorum Waitlist โ†’

โ† Back to Local LLMs

๋กœ์ปฌ LLM ์†๋„๋ฅผ ๋‘ ๋ฐฐ๋กœ ๋†’์ด๋Š” ๋ฐฉ๋ฒ•: ์ตœ์ ํ™” ๊ฐ€์ด๋“œ 2026 | PromptQuorum