PromptQuorumPromptQuorum
ホーム/ローカルLLM/LM Studio Advanced Features in 2026: GPU Settings, LoRA, and Fine-Tuning
Tools & Interfaces

LM Studio Advanced Features in 2026: GPU Settings, LoRA, and Fine-Tuning

·9 min read·Hans Kuepper 著 · PromptQuorumの創設者、マルチモデルAIディスパッチツール · PromptQuorum

LM Studio is primarily a chat app, but it also includes advanced features for developers: GPU memory configuration, context window adjustment, OpenAI-compatible API, and integration with fine-tuning tools. As of April 2026, LM Studio is expanding beyond chat to support professional workflows like LoRA fine-tuning and batch inference.

重要なポイント

  • LM Studio has advanced settings in the Settings → Server tab (GPU options, context length).
  • GPU memory can be manually set from 10% to 100% of VRAM — lower values free up GPU for other apps.
  • Context window (number of tokens the model can see) can be extended up to model limits, but it uses more VRAM.
  • Local API (beta) exposes OpenAI-compatible endpoints at localhost:1234 for integration.
  • As of April 2026, LoRA fine-tuning is not yet built into LM Studio; use Text-Generation-WebUI or training scripts instead.

How Do You Configure GPU Memory in LM Studio?

LM Studio lets you control how much GPU VRAM the model uses:

  • 1. Click Settings (bottom-left gear icon).
  • 2. Find GPU acceleration slider (default: 100%).
  • 3. Slide to 50% if you want the GPU to use 50% of VRAM, freeing up the rest for other applications.
  • 4. Lower GPU allocation = slower inference speed, but more headroom for simultaneous apps.
  • 5. Click Restart to apply changes.

How Do You Extend Context Window?

Context window is the maximum number of tokens (text) the model can read. Extending it allows longer conversations but uses more VRAM.

  • 1. Open Settings → Server.
  • 2. Look for Context length (default: model's built-in limit).
  • 3. Increase to 4k, 8k, 16k, or 32k (depending on model support).
  • 4. Each doubling of context length roughly doubles VRAM usage.
  • 5. Test your extended context by starting a chat and providing long prompts.

How Do You Enable LM Studio's Local API (Beta)?

LM Studio's local API (beta as of April 2026) mimics OpenAI's API:

python
# 1. Open LM Studio Settings → Server
# 2. Turn on "Enable local API server"
# 3. API runs at http://localhost:1234/v1

# 4. Use it like Ollama:
from openai import OpenAI
client = OpenAI(
  base_url="http://localhost:1234/v1",
  api_key="not-needed"
)
response = client.chat.completions.create(
  model="llama-3.2-3b-gguf",
  messages=[{"role": "user", "content": "Hello"}]
)
print(response.choices[0].message.content)

Can You Fine-Tune Models With LM Studio?

As of April 2026, LM Studio does not have built-in LoRA fine-tuning. For fine-tuning, use:

- Text-Generation-WebUI (easiest for LoRA)

- LLaMA-Factory (advanced, production-grade)

- unsloth (fastest, optimal for VRAM usage)

LM Studio is suitable for applying pre-trained LoRA adapters but not for training new ones. Future versions may add LoRA training directly.

How Do You Run Batch Inference in LM Studio?

Batch inference means processing multiple prompts without waiting for responses between them. LM Studio does not have a built-in batch mode, but you can simulate it via the API or Python loop:

python
# Python: batch inference via LM Studio API
from openai import OpenAI
import json

client = OpenAI(base_url="http://localhost:1234/v1", api_key="x")

prompts = [
  "What is 2+2?",
  "Explain quantum computing",
  "How do transformers work?"
]

results = []
for prompt in prompts:
  response = client.chat.completions.create(
    model="llama-3.2-3b-gguf",
    messages=[{"role": "user", "content": prompt}]
  )
  results.append({
    "prompt": prompt,
    "response": response.choices[0].message.content
  })

with open("batch_results.json", "w") as f:
  json.dump(results, f, indent=2)

How Do You Benchmark Model Speed in LM Studio?

LM Studio includes a built-in benchmark tool:

  • 1. Load a model in LM Studio.
  • 2. Click SettingsBenchmark tab.
  • 3. Click Run benchmark — it measures tokens/second for your specific hardware.
  • 4. Results show baseline performance without chat overhead.
  • This helps you understand expected speed before deploying to production.

Common Mistakes With LM Studio Advanced Features

  • Lowering GPU allocation too much and blaming slowness on the model. If you set GPU to 10%, inference will be 5–10× slower because it is running mostly on CPU. Test with 80%+ GPU allocation first.
  • Extending context window beyond model support. Models have maximum supported context lengths. Extending beyond that does not add capability; it just wastes VRAM.
  • Expecting LoRA training in LM Studio. As of April 2026, it is not available. Use Text-Generation-WebUI or training libraries.
  • Forgetting that API needs explicit enable. The local API is off by default. Enable it in Settings → Server.

Common Questions About LM Studio Advanced Features

What is the difference between LM Studio API and Ollama API?

Both expose OpenAI-compatible endpoints. LM Studio API is on localhost:1234, Ollama on localhost:11434. Both work identically. Choose whichever tool you prefer for chatting.

Can I use the LM Studio API in production?

It works, but Ollama API is more mature. LM Studio API is in beta. For production, Ollama is the safer choice.

Does lowering GPU allocation reduce VRAM requirements?

Yes. Lowering GPU allocation to 50% roughly halves VRAM usage, but inference is 2–5× slower because the model runs partially on CPU.

Sources

  • LM Studio Documentation — lmstudio.ai/docs
  • LM Studio Local Server (Beta) — lmstudio.ai/docs/local-server/overview
  • OpenAI API Compatibility — platform.openai.com/docs/api-reference

PromptQuorumで、ローカルLLMを25以上のクラウドモデルと同時に比較しましょう。

PromptQuorumを無料で試す →

← ローカルLLMに戻る

LM Studio Advanced Features Guide | PromptQuorum