éèŠãªãã€ã³ã
- ãã°/ãããã°ç¡å¹åïŒç°¡åïŒïŒ~10%ã®é床åäžã
- Q4éåå䜿çšïŒç°¡åïŒïŒåãé床ã§VRAMåæžã
- ããããµã€ãºæé©åïŒäžçŽïŒïŒãããåŠçã§2â3Ãé床åäžã
- OllamaããvLLMãžåãæ¿ãïŒäžçŽïŒïŒäžŠè¡ãªã¯ãšã¹ãã§2â5Ãé床åäžã
- GPUã¡ã¢ãªäœ¿çšç90%+ïŒäžçŽïŒïŒ15â20%é床åäžã
- å šææ³ã®çµã¿åããïŒ~2â3Ãã®ç·åé«éåã
GPUã¡ã¢ãªäœ¿çšçã¯é床ã«ã©ã圱é¿ãããïŒ
å€ãã®ããŒã«ã¯ããã©ã«ãã§GPU VRAMã®70â80%ã®ã¿äœ¿çšããæ®ããæªäœ¿çšã®ãŸãŸã«ããŠããŸãã 90â95%ã«å¢ãããšããšã³ãžã³ãKVãã£ãã·ã¥ãå€ãäºå確ä¿ã§ããããé床ã15â20%åäžããŸãïŒ
# vLLM: increase GPU memory utilization
vllm serve meta-llama/Llama-2-7b-hf \
--gpu-memory-utilization 0.95
# Ollama: environment variable
export OLLAMA_GPU_THRESHOLD=0.95 # Use 95% of GPU
ollama run llama3.2:3b
# LM Studio: Settings â GPU acceleration slider (move to 100%)ã¹ã«ãŒããããæå€§åããããããµã€ãºã¯ïŒ
ãããåŠçïŒè€æ°ããã³ããåæåŠçïŒã§ã¯ãããããµã€ãºã1ãã32ã«å¢ãããš2â4Ãã®ã¹ã«ãŒãããåäžãåŸãããŸãã
åäžãªã¯ãšã¹ã = ãã€ãã©ã€ã³å©çšçãäœããããã32ãªã¯ãšã¹ã = 2â4Ãã¹ã«ãŒãããã
ãã¬ãŒããªãïŒåå¥ãªã¯ãšã¹ãã®é å»¶ãå¢å ïŒãããå®äºãŸã§åŸ æ©ïŒã
| ããããµã€ãº | ã¹ã«ãŒããã | ã¬ã€ãã³ã·/ãªã¯ãšã¹ã | ãŠãŒã¹ã±ãŒã¹ |
|---|---|---|---|
| 1ïŒåäžïŒ | 50 tok/sec | æå° | ãªã¢ã«ã¿ã€ã ãã£ãã |
| 8 | 120 tok/sec | 蚱容ç¯å² | 軜床ã®äžŠè¡åŠç |
| 32 | 200 tok/sec | é« | ãããAPI |
| 64+ | 250+ tok/sec | éåžžã«é«ã | ãªãã©ã€ã³ããã |
æéã®æšè«ãšã³ãžã³ã¯ïŒvLLM vs Ollama vs llama.cpp
vLLMïŒäžŠè¡ãªã¯ãšã¹ãã§Ollamaãã5â10Ãé«é â è€æ°ãŠãŒã¶ãŒãæ±ããæ¬çªAPIã«æé©ã
llama.cppïŒã³ã³ã·ã¥ãŒããŒããŒããŠã§ã¢ã§ã®åäžãªã¯ãšã¹ããæé â å人ããŒã«ã«ã»ããã¢ããã«æé©ã
OllamaïŒåäžãŠãŒã¶ãŒåãæé«ã®éçºè äœéšïŒåäžãªã¯ãšã¹ãã§ã¯llama.cppãšåçã
Text-Generation-WebUIïŒæãäœéã ãæ©èœè±å¯ â å®éšçšã®ã¿ãæ¬çªç°å¢ã«ã¯äžé©ã
éååã¯æšè«ãé«éåãããïŒ
çŸä»£ã®GPUïŒRTX 40ã·ãªãŒãºïŒã§ã¯ãQ4ãšQ5ã¯FP16ãšåãé床ã§åäœããŸã â é床åäžã§ã¯ãªãVRAMåæžã®ããã«éååããŠãã ããã
éååã®éæ¥çãªé床ã¡ãªããïŒ
- å°ããã¢ãã«ãã¡ã€ã« = ãã£ã¹ã¯ããã®èµ·åæèªã¿èŸŒã¿ãéã
- ã¡ã¢ãªåž¯åå¹ ã®åæž = å€ãã¡ã¢ãªå¶éããŒããŠã§ã¢ã§ãããã«é«éïŒ10â15%ïŒ
éååã¯äž»ã«VRAMåæžã®ããã§ãçã®ããŒã¯ã³ã¹ã«ãŒãããåäžã§ã¯ãããŸããã
çŸå®çã«åŸãããé床æ¹åã¯ïŒ
äŸïŒRTX 4090ã§7Bã¢ãã«ãæé©å â ã¹ããããã€ã¹ãããïŒ
| å€æŽ | é床 | 环èšã²ã€ã³ |
|---|---|---|
| Ollamaããã©ã«ãïŒããŒã¹ã©ã€ã³ïŒ | 120 tok/sec | â |
| ãããã°ãã°ç¡å¹å | 132 tok/sec | +10% |
| GPUã¡ã¢ãª â 95% | 150 tok/sec | +25%åèš |
| vLLMãžåãæ¿ãïŒãããïŒ | 300 tok/secïŒãããïŒ | +2.5ÃïŒãããïŒ |
| å šæé©åçµã¿åãã | 300 tok/sec | +2.5Ãã¹ã«ãŒããã |
ããããé床æé©åã®ééã
- GPUã¡ã¢ãªã100%ã«èšå®ã ã¡ã¢ãªäžè¶³ã¯ã©ãã·ã¥ã®ãªã¹ã¯ãå®å šãªæå€§å€ã¯90â95%ã
- é床ã®ããã«ããããµã€ãºãäžããã ããããµã€ãºã¯åäžãªã¯ãšã¹ãã®é å»¶ã«åœ±é¿ããŸãããã¹ã«ãŒãããã«ã®ã¿å¹æããã
- é床ã®ããã«é床ã«éååã Q4ã¯FP16ãšã»ãŒåãé床ã§ããVRAMã®ããã«éååããé床ã®ããã§ã¯ãããŸããã
- ãããã€éäžã§æšè«ãšã³ãžã³ãåãæ¿ãã Ollama â vLLM â llama.cppãžã®åãæ¿ãã¯ãã°ãåŒãèµ·ãããŸããäžã€ãéžãã§æé©åããŠãã ããã
ãããã質å
ããŒã«ã«LLMæšè«ãé«éåããæã广çãªæ¹æ³ã¯ïŒ
䞊è¡ãªã¯ãšã¹ãã«OllamaããvLLMãžåãæ¿ãããšæå€§ã®é«éåãåŸãããŸã â ãããåŠçã§5â10Ãã®ã¹ã«ãŒãããåäžãåäžãªã¯ãšã¹ãã§ã¯ãGPUã¡ã¢ãªäœ¿çšçã70%ãã90â95%ã«å¢ãããš15â20%ã®é床åäžããããã°ãã°ç¡å¹åã§ããã«10%ã
ãããåŠçã¯åäžãªã¯ãšã¹ãã®é å»¶ãæ¹åãããïŒ
ããã â ããããµã€ãºã¯ã¹ã«ãŒãããïŒå šãªã¯ãšã¹ãã®token/secïŒã«åœ±é¿ããŸãããåäžãªã¯ãšã¹ãã®é å»¶ã«ã¯åœ±é¿ããŸãããé å»¶ãäžããã«ã¯GPUã¡ã¢ãªäœ¿çšçãæé©åããããé«éãªãšã³ãžã³ïŒvLLMãŸãã¯llama.cppïŒã䜿çšããŠãã ããã
vLLMã¯OllamaããäœåéããïŒ
åäžãªã¯ãšã¹ãã§ã¯äž¡è ã¯åæ§ïŒRTX 4090ã§7Bã¢ãã«äœ¿çšæã«äž¡æ¹ãšã~120â150 tok/secïŒã䞊è¡ãªã¯ãšã¹ãã§ã¯ãContinuous BatchingãšPagedAttentionã«ããvLLMã5â10Ãé«éã
éååã¯æšè«ãé«éåãããïŒ
éååã®äž»ãªã¡ãªããã¯VRAMåæžã§ãé床åäžã§ã¯ãããŸãããçŸä»£ã®NVIDIA GPUïŒRTX 40ã·ãªãŒãºïŒã§ã¯ãQ4ãšQ5ã¯FP16ãšåãé床ã§åäœããŸãã鿥çãªé床ã¡ãªããïŒå°ããQ4ã¢ãã«ã¯ãã£ã¹ã¯ããããéãèªã¿èŸŒãŸããŸãã
æå€§é床ã®ããã«GPUã¡ã¢ãªäœ¿çšçã¯äœ%ã«èšå®ãã¹ããïŒ
vLLMã§90â95%ã«èšå®ïŒ--gpu-memory-utilization 0.92ïŒãããã«ãããšã³ãžã³ãKVãã£ãã·ã¥çšã«ããå€ãã®ã¡ã¢ãªãäºå確ä¿ã§ããŸãã100%ã¯é¿ããŠãã ãã â OOMã¯ã©ãã·ã¥ãåŒãèµ·ãããŸãã
ãªãæåã®ããã³ããåŸã«ããŒã«ã«LLMãé ããªãã®ãïŒ
æåã®ããã³ããã¯ã¢ãã«ãVRAMã«ããŒãããŸãïŒã³ãŒã«ãã¹ã¿ãŒãïŒãããã«ã¯10â30ç§ãããå ŽåããããŸããã»ãã·ã§ã³éã§ãµãŒããŒãèµ·åãããŸãŸã«ããŠãã ãããOllamaã§ã¯ãéã¢ã¯ãã£ãåŸã®ã¢ãã«ã¢ã³ããŒããé²ãããOLLAMA_KEEP_ALIVE=24hãèšå®ã
CPUã®ã¿ã®æšè«ãæå³ã®ãã圢ã§é«éåã§ãããïŒ
éå®çãªæ¹åãå¯èœã§ãïŒllama.cppã§-tãã©ã°ã䜿çšããŠç©çã³ã¢æ°ïŒè«çã³ã¢æ°ã§ã¯ãªãïŒã«èšå®ãAVX2/AVX-512åœä»€ã»ãããæå¹åãQ4_K_Méååã䜿çšãçŸå®çãªäžéïŒææ°i9ã§8â12 tok/secãã€ã³ã¿ã©ã¯ãã£ããã£ããã«ã¯GPUããŒããŠã§ã¢ãå¯äžã®éžæè¢ã
ã³ã³ããã¹ãé·ã¯æšè«é床ã«ã©ã圱é¿ãããïŒ
Attentionã¡ã«ããºã ãã³ã³ããã¹ãé·ã«å¯ŸããŠ2次çã«ã¹ã±ãŒã«ãããããé·ãã³ã³ããã¹ããŠã£ã³ããŠã¯æšè«ãé ãããŸãã4Kã³ã³ããã¹ãã®ããã³ããã¯1Kãã~4Ãé ããã·ã¹ãã ããã³ããã¯500ããŒã¯ã³æªæºã«ä¿ã£ãŠãã ããã
PagedAttentionãšã¯äœãããªãvLLMãé«éåããã®ãïŒ
PagedAttentionã¯vLLMã®KVãã£ãã·ã¥ç®¡çã·ã¹ãã ã§ãããªã¯ãšã¹ãããšã«åºå®ã¡ã¢ãªãããã¯ãäºå確ä¿ãã代ããã«ãOSã®ä»®æ³ã¡ã¢ãªã®ããã«ã¡ã¢ãªãåçã«ããŒãžã³ã°ããŸããããã«ããVRAMã®æçåãè§£æ¶ãããGPUå©çšçã~55%ãã90%+ã«åäžããŸãã
GGUFãšsafetensorsã¢ãã«åœ¢åŒã®é床差ã¯ãããïŒ
ã¯ããGGUFïŒllama.cppãšOllamaã䜿çšïŒã¯çµã¿èŸŒã¿éååä»ãã®CPU/ã³ã³ã·ã¥ãŒããŒGPUæšè«ã«æé©åãSafetensorsïŒvLLMãšHuggingFaceã䜿çšïŒã¯å šç²ŸåºŠGPUæšè«ã«é«éãRTX 40ã·ãªãŒãºã§FP16ãå®è¡ããå Žåãsafetensors + vLLMã¯éåžžGGUF + Ollamaãã10â20%åªããŠããŸãã
ãœãŒã¹
- vLLMæé©åã¬ã€ã -- docs.vllm.ai/en/dev_guide/performance_tuning.html
- Ollamaããã©ãŒãã³ã¹ãã³ã -- github.com/ollama/ollama/blob/main/docs/troubleshooting.md