Skip to main content
PromptQuorumPromptQuorum
Home/Local LLMs/Xinference: Llama 3, Qwen, ChatGLM & Mistral ๋กœ์ปฌ ์‹คํ–‰ ๊ฐ€์ด๋“œ 2026
Tools & Interfaces

Xinference: Llama 3, Qwen, ChatGLM & Mistral ๋กœ์ปฌ ์‹คํ–‰ ๊ฐ€์ด๋“œ 2026

ยท10๋ถ„ ๋ถ„๋Ÿ‰ยทBy Hans Kuepper ยท Founder of PromptQuorum, multi-model AI dispatch tool ยท PromptQuorum

`pip install "xinference[all]"`๋กœ Xinference๋ฅผ ์„ค์น˜ํ•˜๊ณ , `xinference-local`๋กœ ์„œ๋ฒ„๋ฅผ ์‹œ์ž‘ํ•œ ํ›„, `xinference launch --model-name llama-3.1-instruct --model-engine transformers --model-size-in-billions 8`์„ ์‹คํ–‰ํ•˜์‹ญ์‹œ์˜ค. Xinference๋Š” Llama 3, Qwen 3, ChatGLM4, Mistral ๋ฐ 30๊ฐœ ์ด์ƒ์˜ ๋ชจ๋ธ ํŒจ๋ฐ€๋ฆฌ๋ฅผ ๊ธฐ๋ณธ ์ง€์›ํ•˜๋ฉฐ, ๋ชจ๋“  ๋ชจ๋ธ์€ localhost:9997์˜ OpenAI ํ˜ธํ™˜ API๋ฅผ ํ†ตํ•ด ์ œ๊ณต๋ฉ๋‹ˆ๋‹ค.

Xinference(Xorbits Inference)๋Š” Llama 3, Qwen 3, ChatGLM4, Mistral ๋“ฑ 30๊ฐœ ์ด์ƒ์˜ ๋ชจ๋ธ ํŒจ๋ฐ€๋ฆฌ๋ฅผ ๋‹จ์ผ pip ๋ช…๋ น์–ด๋กœ ์„ค์น˜ํ•˜๊ณ  CLI ํ•œ ๋ฒˆ์œผ๋กœ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ๋Š” ์˜คํ”ˆ์†Œ์Šค ํ”„๋ ˆ์ž„์›Œํฌ์ž…๋‹ˆ๋‹ค. ๋ชจ๋“  ๋ชจ๋ธ์€ OpenAI ํ˜ธํ™˜ API๋ฅผ ํ†ตํ•ด ์ œ๊ณต๋ฉ๋‹ˆ๋‹ค. ๋‹จ์ผ ์‚ฌ์šฉ์ž ํŽธ์˜์„ฑ์„ ๋ชฉํ‘œ๋กœ ํ•˜๋Š” Ollama์™€ ๋‹ฌ๋ฆฌ, Xinference๋Š” ๋ฉ€ํ‹ฐ ๋ชจ๋ธ ์„œ๋น™, GPU ํด๋Ÿฌ์Šคํ„ฐ ์ง€์›, LLM ์ถ”๋ก ๊ณผ ํ•จ๊ป˜ ์ž„๋ฒ ๋”ฉยท๋ฆฌ๋žญํ‚น์ด ํ•„์š”ํ•œ ํŒ€์„ ์œ„ํ•ด ์„ค๊ณ„๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์ด ๊ฐ€์ด๋“œ์—์„œ๋Š” ์ง€์› ๋ชจ๋ธ ํŒจ๋ฐ€๋ฆฌ, ์„ค์น˜ ๋ฐฉ๋ฒ•, ๋ชจ๋ธ๋ณ„ ์‹คํ–‰ ๋ช…๋ น์–ด, ๊ทธ๋ฆฌ๊ณ  Xinference์™€ OllamaยทvLLM์˜ ๋น„๊ต๋ฅผ ๋‹ค๋ฃน๋‹ˆ๋‹ค.

Key Takeaways

  • Xinference๋Š” ํ•˜๋‚˜์˜ API๋กœ 30๊ฐœ ์ด์ƒ์˜ ๋ชจ๋ธ ํŒจ๋ฐ€๋ฆฌ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค โ€” Llama 3, Qwen 3, ChatGLM4, Mistral, ์ž„๋ฒ ๋”ฉ ๋ชจ๋ธ, ๋ฆฌ๋žญ์ปค ๋ชจ๋‘ localhost:9997/v1 ์—”๋“œํฌ์ธํŠธ๋ฅผ ๊ณต์œ ํ•ฉ๋‹ˆ๋‹ค.
  • pip ์„ค์น˜ ํ•œ ๋ฒˆ, CLI ๋ช…๋ น์–ด ํ•œ ๋ฒˆ โ€” `pip install "xinference[all]"` ํ›„ `xinference-local`๋กœ ์›น UI๊ฐ€ ํฌํ•จ๋œ ์„œ๋ฒ„๋ฅผ ์‹œ์ž‘ํ•˜๊ณ , `xinference launch --model-name <name>`์œผ๋กœ ์ž„์˜ ๋ชจ๋ธ์„ ๋ฐฐํฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ์„ ํƒ ๊ฐ€๋Šฅํ•œ ์„ธ ๊ฐ€์ง€ ๋ฐฑ์—”๋“œ โ€” `transformers`(GPU, ํ’€ ์ •๋ฐ€๋„), `llama.cpp`(CPU + ์–‘์žํ™” GGUF, GPU ๋ถˆํ•„์š”), `vllm`(๊ณ ์ฒ˜๋ฆฌ๋Ÿ‰ ๋ฉ€ํ‹ฐ GPU). ๋ชจ๋ธ๋ณ„๋กœ ์ „ํ™˜ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.
  • CJK ์ž‘์—…์—๋Š” Qwen 3์™€ ChatGLM4๊ฐ€ Xinference ์ตœ์„ ์˜ ์„ ํƒ์ž…๋‹ˆ๋‹ค โ€” ๋‘ ๋ชจ๋ธ ๋ชจ๋‘ ์•ฝ 6~7 GB VRAM์—์„œ ์‹คํ–‰๋˜๋ฉฐ ์ค‘๊ตญ์–ดยท์ผ๋ณธ์–ด ๋ฒค์น˜๋งˆํฌ์—์„œ ์˜์–ด ์ „์šฉ ๋ชจ๋ธ ๋Œ€๋น„ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์ž…๋‹ˆ๋‹ค.
  • ๋ฉ€ํ‹ฐ ๋ชจ๋ธ ์„œ๋น™, ์ž„๋ฒ ๋”ฉยท๋ฆฌ๋žญํ‚น, GPU ํด๋Ÿฌ์Šคํ„ฐ ์ง€์›์ด ํ•„์š”ํ•  ๋•Œ๋Š” Xinference๋ฅผ ์„ ํƒํ•˜์‹ญ์‹œ์˜ค โ€” Ollama๋Š” ๋‹จ์ผ ์‚ฌ์šฉ์ž ๋ฐ์Šคํฌํƒ‘ ํŽธ์˜์„ฑ์—์„œ ์œ ๋ฆฌํ•ฉ๋‹ˆ๋‹ค.

Xinference๋ž€ ๋ฌด์—‡์ด๋ฉฐ ์–ด๋–ป๊ฒŒ ์ž‘๋™ํ•ฉ๋‹ˆ๊นŒ

Xinference(github.com/xorbitsai/inference)๋Š” Xorbits๊ฐ€ ๊ฐœ๋ฐœํ•œ ์˜คํ”ˆ์†Œ์Šค LLM ๋ฐ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋ชจ๋ธ ์„œ๋น™ ํ”„๋ ˆ์ž„์›Œํฌ์ž…๋‹ˆ๋‹ค. ๋ถ„์‚ฐ ํด๋Ÿฌ์Šคํ„ฐ์šฉ ์—”ํ„ฐํ”„๋ผ์ด์ฆˆ ์ถ”๋ก  ํ”Œ๋žซํผ์œผ๋กœ ์‹œ์ž‘ํ•˜์—ฌ 2023๋…„์— ์˜คํ”ˆ์†Œ์Šค๋กœ ๊ณต๊ฐœ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ํ•ต์‹ฌ ๊ฐœ๋…์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค. ๋ชจ๋ธ ์ด๋ฆ„์œผ๋กœ ๋“ฑ๋กํ•˜๋ฉด Xinference๊ฐ€ ๊ฐ€์ค‘์น˜๋ฅผ ๋‹ค์šด๋กœ๋“œํ•˜๊ณ  ์ ์ ˆํ•œ ๋ฐฑ์—”๋“œ๋ฅผ ์„ ํƒํ•˜์—ฌ REST API๋ฅผ ๋…ธ์ถœํ•ฉ๋‹ˆ๋‹ค. ๋ชจ๋ธ ๋กœ๋”ฉ ์ฝ”๋“œ๋ฅผ ์ง์ ‘ ๋‹ค๋ฃฐ ํ•„์š”๊ฐ€ ์—†์Šต๋‹ˆ๋‹ค.

Xinference๋Š” Llama 3, Qwen 3, ChatGLM4, Mistral ๋ฐ 30๊ฐœ ์ด์ƒ์˜ ๋ชจ๋ธ ํŒจ๋ฐ€๋ฆฌ๋ฅผ ๋‹จ์ผ OpenAI ํ˜ธํ™˜ API๋กœ ๊ธฐ๋ณธ ์ง€์›ํ•˜๋Š” ์˜คํ”ˆ์†Œ์Šค ์ถ”๋ก  ์„œ๋ฒ„์ž…๋‹ˆ๋‹ค.

Xinference๋ฅผ ๋กœ์ปฌ AI ๋ชจ๋ธ์šฉ ๊ตํ™˜๊ธฐ๋ผ๊ณ  ์ƒ๊ฐํ•˜์‹ญ์‹œ์˜ค. ๋กœ๋“œํ•  ๋ชจ๋ธ์„ ์ด๋ฆ„์œผ๋กœ ์ง€์ •ํ•˜๋ฉด ๋‹ค์šด๋กœ๋“œํ•˜์—ฌ ์‹œ์ž‘ํ•˜๊ณ , ์•ฑ์€ OpenAI API์™€ ๋™์ผํ•œ ๋ฐฉ์‹์œผ๋กœ ํ†ต์‹ ํ•ฉ๋‹ˆ๋‹ค. ์ฝ”๋“œ ๋ณ€๊ฒฝ์ด ํ•„์š”ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

  • ๋ชจ๋ธ ๋ ˆ์ง€์ŠคํŠธ๋ฆฌ: 200๊ฐœ ์ด์ƒ์˜ ์‚ฌ์ „ ๋“ฑ๋ก ๋ชจ๋ธ. ๊ฐ€์ค‘์น˜ ๊ฒฝ๋กœ๋ฅผ ์ง์ ‘ ๊ด€๋ฆฌํ•˜๋Š” ๋Œ€์‹  ์ด๋ฆ„(`llama-3.1-instruct`, `qwen2.5-instruct`, `chatglm4`)์œผ๋กœ ์ฐธ์กฐํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ๋ฐฑ์—”๋“œ ์ถ”์ƒํ™”: ํ•˜๋‚˜์˜ ๋ช…๋ น์–ด๋กœ transformers, llama.cpp, vLLM ๋ฐฑ์—”๋“œ ๊ฐ„ ์ „ํ™˜ ๊ฐ€๋Šฅ โ€” ๋ฐฑ์—”๋“œ์™€ ๋ฌด๊ด€ํ•˜๊ฒŒ ๋™์ผํ•œ API๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
  • ๋ฉ€ํ‹ฐ ๋ชจ๋ธ ๋™์‹œ ์‹คํ–‰: ๋™์ผ GPU์—์„œ ํ…์ŠคํŠธ ์ƒ์„ฑ์šฉ Llama 3์™€ RAG์šฉ BGE ์ž„๋ฒ ๋”ฉ ๋ชจ๋ธ์„ ๋™์‹œ์— ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ์›น UI: localhost:9997์˜ React ๋Œ€์‹œ๋ณด๋“œ์—์„œ ์ฝ”๋“œ ์ž‘์„ฑ ์—†์ด ๋ชจ๋ธ์„ ์‹คํ–‰ยท๊ฒ€์‚ฌยท์ข…๋ฃŒํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ํด๋Ÿฌ์Šคํ„ฐ ๋ชจ๋“œ: ์Šˆํผ๋ฐ”์ด์ € + ์›Œ์ปค ์•„ํ‚คํ…์ฒ˜๋กœ ์›Œ์ปค์—์„œ `xinference start --host 0.0.0.0`์„ ์‹คํ–‰ํ•˜์—ฌ ์—ฌ๋Ÿฌ GPU ๋…ธ๋“œ์— ๊ฑธ์ณ ํ™•์žฅํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ง€์› ๋ชจ๋ธ ํŒจ๋ฐ€๋ฆฌ: Llama 3, Qwen, ChatGLM, Mistral

์•„๋ž˜ ํ‘œ๋Š” Xinference์—์„œ ๊ฐ€์žฅ ๋งŽ์ด ์š”์ฒญ๋˜๋Š” 7๊ฐ€์ง€ ๋ชจ๋ธ ๊ตฌ์„ฑ๊ณผ ๊ฐ๊ฐ์˜ ์ตœ์†Œ VRAM ์š”๊ตฌ ์‚ฌํ•ญ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. 7๊ฐ€์ง€ ๋ชจ๋‘ ๋™์ผํ•œ ์‹คํ–‰ ๋ช…๋ น์–ด ํŒจํ„ด์„ ๊ณต์œ ํ•˜๋ฉฐ `--model-name`, `--model-size-in-billions`, ์„ ํƒ์ ์œผ๋กœ `--quantization`๋งŒ ๋ณ€๊ฒฝํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค.

Xinference๋Š” Llama 3.3(8B/70B), Qwen 3(7B/72B), ChatGLM4 9B, Mistral Small v0.3, Mixtral 8x22B๋ฅผ ๊ธฐ๋ณธ ์ง€์›ํ•˜๋ฉฐ ๊ฐ๊ฐ CLI ๋ช…๋ น์–ด ํ•œ ๋ฒˆ์œผ๋กœ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

VRAM์€ GPU์˜ ๋ฉ”๋ชจ๋ฆฌ์ž…๋‹ˆ๋‹ค. 6 GB VRAM์ด ํ•„์š”ํ•œ ๋ชจ๋ธ์€ ์ตœ์†Œ ๊ทธ ์ด์ƒ์˜ GPU(์˜ˆ: RTX 3060 12 GB ๋˜๋Š” RTX 4060 8 GB)๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. GPU๊ฐ€ ์ž‘๋‹ค๋ฉด llama.cpp ๋ฐฑ์—”๋“œ์™€ Q4 ์–‘์žํ™”๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์„ ๋Œ€๋žต ์ ˆ๋ฐ˜์œผ๋กœ ์ค„์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋ชจ๋ธํŒจ๋ฐ€๋ฆฌVRAM (Q4)๊ถŒ์žฅ ๋ฐฑ์—”๋“œ์ ํ•ฉํ•œ ์šฉ๋„
llama-3.1-instruct 8BMeta~6 GBtransformers / llama.cpp์˜์–ด ๋ฒ”์šฉ
llama-3.1-instruct 70BMeta~40 GBvLLM๊ณ ํ’ˆ์งˆ ์˜์–ด ์ถœ๋ ฅ
qwen2.5-instruct 7BAlibaba~6 GBtransformers / llama.cpp๋‹ค๊ตญ์–ด, CJK, ์ฝ”๋”ฉ
qwen2.5-instruct 72BAlibaba~40 GBvLLM๋Œ€๊ทœ๋ชจ CJK ์ž‘์—…
chatglm4 9BZhipu AI~7 GBtransformers์ค‘๊ตญ์–ด ์—”ํ„ฐํ”„๋ผ์ด์ฆˆ ์ž‘์—…
mistral-instruct-v0.3 7BMistral AI~5 GBtransformers / llama.cpp์œ ๋Ÿฝ ์–ธ์–ด, ํ•จ์ˆ˜ ํ˜ธ์ถœ
mixtral-instruct-v0.1 8x7BMistral AI~26 GBvLLM๊ณ ํ’ˆ์งˆ ๋‹ค๊ตญ์–ด

Xinference๋Š” Llama 3.3์„ ์ง€์›ํ•ฉ๋‹ˆ๊นŒ?

์ง€์›ํ•ฉ๋‹ˆ๋‹ค. 8B ๋ณ€ํ˜•์€ `--model-name llama-3.1-instruct`์™€ `--model-size-in-billions 8`์„, 70B๋Š” `70`์„ ์‚ฌ์šฉํ•˜์‹ญ์‹œ์˜ค. ๋‘ ๋ชจ๋ธ ๋ชจ๋‘ ๊ธฐ๋ณธ์ ์œผ๋กœ transformers ๋ฐฑ์—”๋“œ๋ฅผ ์‚ฌ์šฉํ•˜๋ฉฐ, CPU ๋˜๋Š” ์ € VRAM ํ™˜๊ฒฝ์—์„œ๋Š” `--model-engine llama.cpp`์™€ `--quantization q4_k_m`์œผ๋กœ ์ „ํ™˜ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Xinference๋Š” Qwen 3์„ ์ง€์›ํ•ฉ๋‹ˆ๊นŒ?

์ง€์›ํ•ฉ๋‹ˆ๋‹ค. Qwen 3 Instruct๋Š” `qwen2.5-instruct`๋กœ ๋“ฑ๋ก๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. 0.5B, 1.5B, 3B, 7B, 14B, 32B, 72B ํฌ๊ธฐ๊ฐ€ ๋ชจ๋‘ ์ œ๊ณต๋ฉ๋‹ˆ๋‹ค. 7B ๋ณ€ํ˜•์€ ์•ฝ 6 GB VRAM์—์„œ ์‹คํ–‰๋˜๋ฉฐ ์ค‘๊ตญ์–ด, ์ผ๋ณธ์–ด, ํ•œ๊ตญ์–ด, ์˜์–ด๋ฅผ Llama 3.3 8B์™€ ์œ ์‚ฌํ•œ ํ’ˆ์งˆ๋กœ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค.

Xinference๋Š” ChatGLM์„ ์ง€์›ํ•ฉ๋‹ˆ๊นŒ?

์ง€์›ํ•ฉ๋‹ˆ๋‹ค. ChatGLM3(`chatglm3`), ChatGLM4(`chatglm4`), ๋น„์ „ ๋ณ€ํ˜• ChatGLM4-Vision(`chatglm4v`)์ด ๋ชจ๋‘ ๋“ฑ๋ก๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. ChatGLM4 9B๋Š” 2026๋…„ ์ค‘๊ตญ์–ด ์ž‘์—…์— ๊ถŒ์žฅ๋˜๋Š” ์„ ํƒ์ž…๋‹ˆ๋‹ค.

Xinference๋Š” Mistral์„ ์ง€์›ํ•ฉ๋‹ˆ๊นŒ?

์ง€์›ํ•ฉ๋‹ˆ๋‹ค. `mistral-instruct-v0.3`(7B)๊ณผ `mixtral-instruct-v0.1`(8x7B MoE)์ด ๋ชจ๋‘ ๋“ฑ๋ก๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. ํ•จ์ˆ˜ ํ˜ธ์ถœ ๋ฐ JSON ์ถœ๋ ฅ์—๋Š” Mistral Small v0.3์ด Xinference์—์„œ ๊ฐ€์žฅ ์ข‹์€ ์†Œํ˜• ๋ชจ๋ธ ์˜ต์…˜์ž…๋‹ˆ๋‹ค.

Xinference ์„ค์น˜: pip ์„ค์น˜ ๋ฐ ์„œ๋ฒ„ ์‹œ์ž‘

Xinference๋Š” Python 3.9 ์ด์ƒ๊ณผ pip๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. `[all]` ์˜ต์…˜์€ CUDA ์ง€์›, llama.cpp ๋ฐฑ์—”๋“œ, transformers ๋ฐฑ์—”๋“œ๋ฅผ ํ•œ ๋ฒˆ์— ์„ค์น˜ํ•ฉ๋‹ˆ๋‹ค. CPU ์ „์šฉ ๋จธ์‹ ์—์„œ๋Š” `pip install xinference`(`[all]` ์—†์ด)๋ฅผ ์‚ฌ์šฉํ•˜๊ณ  ๋ชจ๋ธ ์‹คํ–‰ ์‹œ `--model-engine llama.cpp`๋ฅผ ์ถ”๊ฐ€ํ•˜์‹ญ์‹œ์˜ค.

`pip install "xinference[all]"`๋กœ Xinference๋ฅผ ์„ค์น˜ํ•˜๊ณ  `xinference-local`๋กœ ์„œ๋ฒ„๋ฅผ ์‹œ์ž‘ํ•˜๋ฉด http://localhost:9997์—์„œ ์›น UI์— ์ ‘์†ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

bash
# ์ „์ฒด ์„ค์น˜ โ€” CUDA + transformers + llama.cpp ๋ฐฑ์—”๋“œ
pip install "xinference[all]"

# CPU ์ „์šฉ ์„ค์น˜ (GPU ๋ถˆํ•„์š”)
pip install xinference

# ๋กœ์ปฌ ์„œ๋ฒ„ ์‹œ์ž‘ (์›น UI: http://localhost:9997)
xinference-local

# LAN ์ ‘๊ทผ์„ ์œ„ํ•œ ํŠน์ • ํ˜ธ์ŠคํŠธ ๋ฐ”์ธ๋”ฉ
xinference-local --host 0.0.0.0 --port 9997

Xinference๋ฅผ ์‚ฌ์šฉํ•˜๋ ค๋ฉด GPU๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๊นŒ?

ํ•„์š”ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. llama.cpp ๋ฐฑ์—”๋“œ(`--model-engine llama.cpp`)๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ์–‘์žํ™”๋œ GGUF ๋ชจ๋ธ์„ CPU๋งŒ์œผ๋กœ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. GPU ์ถ”๋ก ๋ณด๋‹ค ์„ฑ๋Šฅ์ด ๋А๋ฆฌ์ง€๋งŒ Python 3.9 ์ด์ƒ์ด ์„ค์น˜๋œ ๋ชจ๋“  ๋จธ์‹ ์—์„œ ๋™์ž‘ํ•ฉ๋‹ˆ๋‹ค.

Xinference๋ฅผ ์—…๋ฐ์ดํŠธํ•˜๋ ค๋ฉด ์–ด๋–ป๊ฒŒ ํ•ด์•ผ ํ•ฉ๋‹ˆ๊นŒ?

`pip install --upgrade xinference`๋ฅผ ์‹คํ–‰ํ•˜์‹ญ์‹œ์˜ค. ํด๋Ÿฌ์Šคํ„ฐ ๋ชจ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ ์—…๊ทธ๋ ˆ์ด๋“œ ์ „์— GitHub ๋ฆด๋ฆฌ์Šค ํŽ˜์ด์ง€์—์„œ ํ˜ธํ™˜์„ฑ ๋ณ€๊ฒฝ ์‚ฌํ•ญ์„ ํ™•์ธํ•˜์‹ญ์‹œ์˜ค.

Llama 3, Qwen, ChatGLM, Mistral ์‹คํ–‰ ๋ฐฉ๋ฒ•

`xinference launch`๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋“ฑ๋ก๋œ ๋ชจ๋ธ์„ ๋ฐฐํฌํ•˜์‹ญ์‹œ์˜ค. ํŒจํ„ด์€ ํ•ญ์ƒ ๋™์ผํ•ฉ๋‹ˆ๋‹ค. `--model-name`์œผ๋กœ ๋ชจ๋ธ ํŒจ๋ฐ€๋ฆฌ๋ฅผ ์„ค์ •ํ•˜๊ณ , `--model-size-in-billions`์œผ๋กœ ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜๋ฅผ ์„ค์ •ํ•˜๋ฉฐ, `--model-engine`์œผ๋กœ ๋ฐฑ์—”๋“œ๋ฅผ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค. ์‹คํ–‰ ํ›„ Xinference๋Š” API ํ˜ธ์ถœ์— ์‚ฌ์šฉํ•˜๋Š” ๋ชจ๋ธ UID๋ฅผ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค.

`xinference launch --model-name <name> --model-engine transformers --model-size-in-billions <size>`๋กœ ์ž„์˜ Xinference ๋ชจ๋ธ์„ ์‹คํ–‰ํ•˜๋ฉด ๋‹ค์šด๋กœ๋“œ ์™„๋ฃŒ ํ›„ ์ˆ˜ ์ดˆ ๋‚ด์— localhost:9997/v1์—์„œ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

bash
# Llama 3.3 8B Instruct (GPU, transformers ๋ฐฑ์—”๋“œ)
xinference launch \
  --model-name llama-3.1-instruct \
  --model-engine transformers \
  --model-size-in-billions 8

# Llama 3.3 8B Instruct (CPU, Q4_K_M ์–‘์žํ™”)
xinference launch \
  --model-name llama-3.1-instruct \
  --model-engine llama.cpp \
  --model-size-in-billions 8 \
  --quantization q4_k_m

# Qwen 3 7B Instruct (GPU)
xinference launch \
  --model-name qwen2.5-instruct \
  --model-engine transformers \
  --model-size-in-billions 7

# ChatGLM4 9B (GPU)
xinference launch \
  --model-name chatglm4 \
  --model-engine transformers \
  --model-size-in-billions 9

# Mistral Small Instruct v0.3 (GPU)
xinference launch \
  --model-name mistral-instruct-v0.3 \
  --model-engine transformers \
  --model-size-in-billions 7

# Mixtral 8x22B Instruct (vLLM ๋ฐฑ์—”๋“œ, 26 GB ์ด์ƒ VRAM ํ•„์š”)
xinference launch \
  --model-name mixtral-instruct-v0.1 \
  --model-engine vllm \
  --model-size-in-billions 46

Xinference๊ฐ€ ์ง€์›ํ•˜๋Š” ๋ชจ๋“  ๋ชจ๋ธ ๋ชฉ๋ก์€ ์–ด๋–ป๊ฒŒ ํ™•์ธํ•ฉ๋‹ˆ๊นŒ?

`xinference registrations --model-type LLM`์„ ์‹คํ–‰ํ•˜์—ฌ ๋“ฑ๋ก๋œ ๋ชจ๋“  LLM ํŒจ๋ฐ€๋ฆฌ๋ฅผ ํ™•์ธํ•˜๊ฑฐ๋‚˜, http://localhost:9997์˜ ์›น UI์—์„œ ๋ชจ๋ธ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ํƒ์ƒ‰ํ•˜์‹ญ์‹œ์˜ค.

Xinference์—์„œ ๋‘ ๊ฐœ์˜ ๋ชจ๋ธ์„ ๋™์‹œ์— ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๊นŒ?

๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค. `xinference launch`๋ฅผ ์„œ๋กœ ๋‹ค๋ฅธ ๋ชจ๋ธ ์ด๋ฆ„์œผ๋กœ ๋‘ ๋ฒˆ ์‹คํ–‰ํ•˜์‹ญ์‹œ์˜ค. ๊ฐ ๋ชจ๋ธ์€ ๊ณ ์œ ํ•œ UID์™€ ์—”๋“œํฌ์ธํŠธ๋ฅผ ๊ฐ–์Šต๋‹ˆ๋‹ค. ์ด VRAM ์˜ˆ์‚ฐ์ด ๋‘ ๋ชจ๋ธ์„ ๋™์‹œ์— ์ˆ˜์šฉํ•  ์ˆ˜ ์žˆ์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

OpenAI ํ˜ธํ™˜ API ์‚ฌ์šฉ ๋ฐฉ๋ฒ•

Xinference์˜ API๋Š” OpenAI API์˜ ๋“œ๋กญ์ธ ๋Œ€์ฒด์ œ์ž…๋‹ˆ๋‹ค. ์ž„์˜์˜ OpenAI ํด๋ผ์ด์–ธํŠธ๋ฅผ `http://localhost:9997/v1`๋กœ ์ง€์ •ํ•˜๊ณ , `api_key`๋ฅผ ๋นˆ ๋ฌธ์ž์—ด์ด ์•„๋‹Œ ์ž„์˜ ๊ฐ’์œผ๋กœ ์„ค์ •ํ•œ ํ›„, `xinference launch`๊ฐ€ ๋ฐ˜ํ™˜ํ•œ ๋ชจ๋ธ UID๋ฅผ `model` ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ ์‚ฌ์šฉํ•˜์‹ญ์‹œ์˜ค. ๊ธฐ์กด LangChain, LlamaIndex, ๋˜๋Š” ์ปค์Šคํ…€ OpenAI ํด๋ผ์ด์–ธํŠธ ์ฝ”๋“œ๋ฅผ ๋ณ€๊ฒฝ ์—†์ด ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

base_url์„ http://localhost:9997/v1๋กœ ์„ค์ •ํ•˜๊ณ  ๋ชจ๋ธ ์ด๋ฆ„์„ model ID๋กœ ์‚ฌ์šฉํ•˜๋ฉด ์ž„์˜์˜ OpenAI ํ˜ธํ™˜ ํด๋ผ์ด์–ธํŠธ๋ฅผ Xinference์— ์—ฐ๊ฒฐํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

OpenAI ํ˜ธํ™˜ API๋ž€ ์ฝ”๋“œ๋ฅผ ๋ณ€๊ฒฝํ•  ํ•„์š”๊ฐ€ ์—†๋‹ค๋Š” ์˜๋ฏธ์ž…๋‹ˆ๋‹ค. GPT-4๋ฅผ ํ˜ธ์ถœํ•˜๋Š” ๋™์ผํ•œ Python ์ฝ”๋“œ๋กœ Xinference๋ฅผ ํ†ตํ•ด Llama 3๋ฅผ ํ˜ธ์ถœํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. base URL๊ณผ ๋ชจ๋ธ ์ด๋ฆ„๋งŒ ๊ต์ฒดํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค.

python
from openai import OpenAI

client = OpenAI(
    api_key="not-required",   # Xinference๋Š” ๋นˆ ๋ฌธ์ž์—ด์ด ์•„๋‹Œ ๋ชจ๋“  ๊ฐ’์„ ์ˆ˜๋ฝํ•ฉ๋‹ˆ๋‹ค
    base_url="http://localhost:9997/v1"
)

# ์ฑ„ํŒ… ์™„์„ฑ โ€” Llama 3, Qwen, ChatGLM, Mistral ๋ชจ๋‘ ๋™์ž‘ํ•ฉ๋‹ˆ๋‹ค
response = client.chat.completions.create(
    model="llama-3.1-instruct",   # ๋ชจ๋ธ ์ด๋ฆ„์„ UID๋กœ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Summarise the GDPR in 3 bullet points."}
    ]
)
print(response.choices[0].message.content)

# ์ž„๋ฒ ๋”ฉ ๋ชจ๋ธ (๋จผ์ € bge-base-en-v1.5๋ฅผ ๋ณ„๋„๋กœ xinference launch ํ•˜์‹ญ์‹œ์˜ค)
embedding = client.embeddings.create(
    model="bge-base-en-v1.5",
    input="Local LLMs preserve data privacy."
)
print(embedding.data[0].embedding[:5])

Xinference๋Š” ์ŠคํŠธ๋ฆฌ๋ฐ ์‘๋‹ต์„ ์ง€์›ํ•ฉ๋‹ˆ๊นŒ?

์ง€์›ํ•ฉ๋‹ˆ๋‹ค. `chat.completions.create` ํ˜ธ์ถœ์—์„œ `stream=True`๋กœ ์„ค์ •ํ•˜์‹ญ์‹œ์˜ค. Xinference๋Š” ๋ชจ๋“  ์ง€์› ๋ฐฑ์—”๋“œ์—์„œ ์‹ค์‹œ๊ฐ„์œผ๋กœ ํ† ํฐ์„ ์ŠคํŠธ๋ฆฌ๋ฐํ•ฉ๋‹ˆ๋‹ค.

Xinference์™€ ํ•จ๊ป˜ LangChain์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๊นŒ?

๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค. `langchain_openai`์˜ `ChatOpenAI(base_url="http://localhost:9997/v1", api_key="x", model="llama-3.1-instruct")`๋ฅผ ์‚ฌ์šฉํ•˜์‹ญ์‹œ์˜ค. ๋ณ„๋„์˜ Xinference ์ „์šฉ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋Š” ํ•„์š”ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

Xinference vs Ollama vs vLLM: ์„ ํƒ ๊ธฐ์ค€

๊ฐ€์žฅ ์ผ๋ฐ˜์ ์ธ ์„ธ ๊ฐ€์ง€ ๋กœ์ปฌ ์ถ”๋ก  ํ”„๋ ˆ์ž„์›Œํฌ๋Š” ๊ฐ๊ฐ ๋‹ค๋ฅธ ์‚ฌ์šฉ์ž๋ฅผ ๋Œ€์ƒ์œผ๋กœ ํ•ฉ๋‹ˆ๋‹ค. ์ฃผ์š” ์ œ์•ฝ ์กฐ๊ฑด์„ ๊ธฐ์ค€์œผ๋กœ ์„ ํƒํ•˜์‹ญ์‹œ์˜ค.

์—ฌ๋Ÿฌ ๋ชจ๋ธ ์œ ํ˜•(LLM + ์ž„๋ฒ ๋”ฉ + ๋ฆฌ๋žญ์ปค)์„ ๋™์‹œ์— ์„œ๋น™ํ•˜๊ฑฐ๋‚˜ ๋„ค์ดํ‹ฐ๋ธŒ ChatGLM ์ง€์›์ด ํ•„์š”ํ•  ๋•Œ๋Š” Xinference๋ฅผ, ๋‹จ์ผ ์‚ฌ์šฉ์ž ๋ฐ์Šคํฌํƒ‘ ํŽธ์˜์„ฑ์„ ์›ํ•  ๋•Œ๋Š” Ollama๋ฅผ ์„ ํƒํ•˜์‹ญ์‹œ์˜ค.

๊ธฐ์ค€XinferenceOllamavLLM
์ตœ์  ์šฉ๋„ํŒ€, ๋ฉ€ํ‹ฐ ๋ชจ๋ธ, ์ž„๋ฒ ๋”ฉ + LLM๋‹จ์ผ ์‚ฌ์šฉ์ž ๋ฐ์Šคํฌํƒ‘, Modelfile ์›Œํฌํ”Œ๋กœ๊ณ ์ฒ˜๋ฆฌ๋Ÿ‰ GPU ์„œ๋น™
GPU ํ•„์š” ์—ฌ๋ถ€๋ถˆํ•„์š” (llama.cpp ๋ฐฑ์—”๋“œ)๋ถˆํ•„์š” (CPU ๋ชจ๋“œ ์ง€์›)ํ•„์š” (CUDA/ROCm)
๋ชจ๋ธ ์ „ํ™˜์—ฌ๋Ÿฌ ๋ชจ๋ธ ๋™์‹œ ์‹คํ–‰ํ•œ ๋ฒˆ์— ํ•˜๋‚˜์˜ ๋ชจ๋ธ (๊ต์ฒด)์„œ๋ฒ„ ์ธ์Šคํ„ด์Šค๋‹น ํ•˜๋‚˜์˜ ๋ชจ๋ธ
์ž„๋ฒ ๋”ฉ ์ง€์›์ง€์› (BGE, E5 ๋“ฑ)์ง€์› (์ œํ•œ์ )๋ฏธ์ง€์› (๋ณ„๋„ ์ž„๋ฒ ๋”ฉ ์„œ๋ฒ„ ํ•„์š”)
์›น UIlocalhost:9997์— ๋‚ด์žฅ์—†์Œ (Open WebUI ์‚ฌ์šฉ)์—†์Œ
ChatGLM ์ง€์›๊ธฐ๋ณธ ์ง€์› (chatglm4)์ œํ•œ์ ์ œํ•œ์ 

Xinference๋Š” Ollama๋ณด๋‹ค ์„ค์ •์ด ๋ณต์žกํ•ฉ๋‹ˆ๊นŒ?

๋‹ค์†Œ ๋ณต์žกํ•ฉ๋‹ˆ๋‹ค. Ollama๋Š” ๋‹จ์ผ ๋ฐ”์ด๋„ˆ๋ฆฌ ๋‹ค์šด๋กœ๋“œ๋กœ ์„ค์น˜๋˜์ง€๋งŒ, Xinference๋Š” Python๊ณผ pip๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๋‘ ๋„๊ตฌ ๋ชจ๋‘ 5๋ถ„ ์ด๋‚ด์— ์ค€๋น„๊ฐ€ ์™„๋ฃŒ๋ฉ๋‹ˆ๋‹ค. Xinference๋Š” ์‹คํ–‰ ํ›„ ๋” ํ’๋ถ€ํ•œ ๋ฉ€ํ‹ฐ ๋ชจ๋ธ ํ™˜๊ฒฝ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

Xinference๊ฐ€ vLLM์„ ๋Œ€์ฒดํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๊นŒ?

๋‹จ์ผ ๋จธ์‹  ์„œ๋น™์˜ ๊ฒฝ์šฐ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค. Xinference๋Š” vLLM์„ ๋ฐฑ์—”๋“œ๋กœ ์‚ฌ์šฉ(`--model-engine vllm`)ํ•˜๋ฉด์„œ ์›น UI์™€ ๋ชจ๋ธ ๋ ˆ์ง€์ŠคํŠธ๋ฆฌ๋ฅผ ์ถ”๊ฐ€๋กœ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ์—ฌ๋Ÿฌ GPU ๋…ธ๋“œ์— ๊ฑธ์นœ ์ตœ๋Œ€ ์›์‹œ ์ฒ˜๋ฆฌ๋Ÿ‰์ด ํ•„์š”ํ•œ ๊ฒฝ์šฐ์—๋Š” ์ „์šฉ vLLM ๋ฐฐํฌ๊ฐ€ ์—ฌ์ „ํžˆ ๋” ๋น ๋ฆ…๋‹ˆ๋‹ค.

์ž์ฃผ ๋ฌป๋Š” ์งˆ๋ฌธ

Xinference๋ž€ ๋ฌด์—‡์ž…๋‹ˆ๊นŒ?

Xinference(Xorbits Inference)๋Š” OpenAI ํ˜ธํ™˜ API๋ฅผ ํ†ตํ•ด Llama 3, Qwen, ChatGLM, Mistral ๋ฐ 30๊ฐœ ์ด์ƒ์˜ ํŒจ๋ฐ€๋ฆฌ๋ฅผ ๋กœ์ปฌ์—์„œ ์‹คํ–‰ํ•˜๋Š” ์˜คํ”ˆ์†Œ์Šค ๋ชจ๋ธ ์„œ๋น™ ํ”„๋ ˆ์ž„์›Œํฌ์ž…๋‹ˆ๋‹ค. GPU, CPU(llama.cpp ๊ฒฝ์œ ), ๋ฉ€ํ‹ฐ GPU ํด๋Ÿฌ์Šคํ„ฐ ๋ฐฐํฌ๋ฅผ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค.

2026๋…„ Xinference๊ฐ€ ์ง€์›ํ•˜๋Š” ๋ชจ๋ธ์€ ๋ฌด์—‡์ž…๋‹ˆ๊นŒ?

Xinference๋Š” 200๊ฐœ ์ด์ƒ์˜ ๋ชจ๋ธ ๊ตฌ์„ฑ์„ ๋“ฑ๋กํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. 2026๋…„ ๊ฐ€์žฅ ์ธ๊ธฐ ์žˆ๋Š” ๋ชจ๋ธ์€ Llama 3.3 8B/70B Instruct, Qwen 3 7B/72B Instruct, ChatGLM4 9B, Mistral Small Instruct v0.3, Mixtral 8x22B Instruct์ž…๋‹ˆ๋‹ค. `xinference registrations --model-type LLM`์œผ๋กœ ์ „์ฒด ๋ชฉ๋ก์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Xinference๋Š” ๋ชจ๋ธ ๊ฐ€์ค‘์น˜๋ฅผ ์–ด๋–ป๊ฒŒ ๋‹ค์šด๋กœ๋“œํ•ฉ๋‹ˆ๊นŒ?

๊ฐ ๋ชจ๋ธ์— ๋Œ€ํ•ด ์ฒ˜์Œ `xinference launch`๋ฅผ ์‹คํ–‰ํ•˜๋ฉด Xinference๊ฐ€ Hugging Face ๋˜๋Š” ModelScope(์„ค์ • ๊ฐ€๋Šฅ)์—์„œ ๊ฐ€์ค‘์น˜๋ฅผ ๋‹ค์šด๋กœ๋“œํ•ฉ๋‹ˆ๋‹ค. ๊ฐ€์ค‘์น˜๋Š” ๋กœ์ปฌ์— ์บ์‹œ๋˜์–ด ์ดํ›„ ์‹คํ–‰ ์‹œ์—๋Š” ์ฆ‰์‹œ ์‹œ์ž‘๋ฉ๋‹ˆ๋‹ค. `XINFERENCE_HOME` ํ™˜๊ฒฝ ๋ณ€์ˆ˜๋กœ ์บ์‹œ ๋””๋ ‰ํ† ๋ฆฌ๋ฅผ ์ œ์–ดํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Xinference๋Š” Windows์—์„œ ๋™์ž‘ํ•ฉ๋‹ˆ๊นŒ?

๋™์ž‘ํ•ฉ๋‹ˆ๋‹ค. Python 3.9 ์ด์ƒ์—์„œ pip๋กœ ์„ค์น˜ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. llama.cpp ๋ฐฑ์—”๋“œ๋Š” ์ถ”๊ฐ€ ์˜์กด์„ฑ ์—†์ด Windows CPU์—์„œ ๋™์ž‘ํ•ฉ๋‹ˆ๋‹ค. Windows์—์„œ GPU ์ง€์›์ด ํ•„์š”ํ•œ ๊ฒฝ์šฐ Xinference ์„ค์น˜ ์ „์— CUDA 12.x์™€ ํ•ด๋‹น PyTorch ํœ ์„ ์„ค์น˜ํ•˜์‹ญ์‹œ์˜ค.

Xinference๋ฅผ RAG์— ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๊นŒ?

์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. LLM๊ณผ ํ•จ๊ป˜ BGE ๋˜๋Š” E5 ์ž„๋ฒ ๋”ฉ ๋ชจ๋ธ์„ ์‹คํ–‰(`xinference launch --model-name bge-base-en-v1.5 --model-type embedding`)ํ•˜์‹ญ์‹œ์˜ค. ๋‘ ๋ชจ๋ธ์€ ๋™์ผํ•œ API ์—”๋“œํฌ์ธํŠธ๋ฅผ ๊ณต์œ ํ•ฉ๋‹ˆ๋‹ค. RAG ํŒŒ์ดํ”„๋ผ์ธ์€ ์ธ๋ฑ์‹ฑ์—๋Š” ์ž„๋ฒ ๋”ฉ ์—”๋“œํฌ์ธํŠธ๋ฅผ, ์ƒ์„ฑ์—๋Š” ์ฑ„ํŒ… ์—”๋“œํฌ์ธํŠธ๋ฅผ ํ˜ธ์ถœํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค.

A Note on Third-Party Facts

This article references third-party AI models, benchmarks, prices, and licenses. The AI landscape changes rapidly. Benchmark scores, license terms, model names, and API prices can shift between the time of writing and the time you read this. Before making deployment or compliance decisions based on this article, verify current figures on each providerโ€™s official source: Hugging Face model cards for licenses and benchmarks, provider websites for API pricing, and EUR-Lex for current GDPR and EU AI Act text. This article reflects publicly available information as of May 2026.

Run PromptQuorum with a local LLM, your own API keys, or both โ€” you pick the backend.

Join the PromptQuorum Waitlist โ†’

โ† Back to Local LLMs

Xinference 2026: Llama 3, Qwen, ChatGLM & Mistral ์„ค์ • ๋ฐฉ๋ฒ•