关键要点
- Ollama exposes a REST API at `http://localhost:11434/v1` that mirrors OpenAI's API exactly.
- Use the OpenAI Python library: change `api_key="openai"` to `api_key="ollama"` and `base_url="http://localhost:11434/v1"`.
- Same approach in Node.js: OpenAI SDK, point to localhost:11434.
- The OpenAI-compatible API is identical across Ollama, vLLM, and LM Studio — no code changes needed to switch providers.
- As of April 2026, streaming (streaming responses token-by-token) and function calling both work with local models via this API.
What Does OpenAI-Compatible Mean?
OpenAI-compatible means the API endpoint returns responses in the same format as OpenAI's API. This allows any library or tool built for OpenAI to work with local models by pointing to a different URL.
Example: The OpenAI Python library sends requests like this:
``` POST /chat/completions { "model": "gpt-4o", "messages": [...], "temperature": 0.7 } ```
Ollama's API accepts the exact same request at `localhost:11434/v1/chat/completions` and returns the response in OpenAI's format:
``` { "choices": [{"message": {"content": "..."}}], "usage": {"prompt_tokens": 10, "completion_tokens": 20} } ```
Because the format is identical, you do not need to learn a new API or rewrite your code.
What Is Ollama's API Endpoint?
When you run `ollama serve`, Ollama starts a REST API at `http://localhost:11434`. The OpenAI-compatible endpoints are:
- Chat completions: `POST http://localhost:11434/v1/chat/completions` — matches `/chat/completions` from OpenAI.
- Text completions: `POST http://localhost:11434/v1/completions` — matches `/completions` from OpenAI.
- Embeddings: `POST http://localhost:11434/v1/embeddings` — convert text to vectors.
- List models: `GET http://localhost:11434/v1/models` — list available models.
How to Use Ollama API With Python (OpenAI Library)
Install the OpenAI library and point it to localhost:
# 1. Install the OpenAI library
pip install openai
# 2. Connect to Ollama
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama" # dummy key; Ollama ignores it
)
# 3. Make a request
response = client.chat.completions.create(
model="llama3.2:3b",
messages=[
{"role": "user", "content": "What is 2+2?"}
]
)
print(response.choices[0].message.content)How to Use Ollama API With Node.js
Install the OpenAI SDK and connect:
// 1. Install
npm install openai
// 2. Connect to Ollama
const OpenAI = require("openai").default;
const client = new OpenAI({
baseURL: "http://localhost:11434/v1",
apiKey: "ollama"
});
// 3. Make a request
const response = await client.chat.completions.create({
model: "llama3.2:3b",
messages: [{
role: "user",
content: "What is 2+2?"
}]
});
console.log(response.choices[0].message.content);How to Use Ollama API From JavaScript in the Browser
Calling Ollama from browser-side JavaScript requires the browser and server to be on the same machine (or allow CORS). For security, browser requests to localhost work only if the JavaScript is served from localhost.
If you need to call Ollama from a browser on a different IP, set up a CORS proxy or use a server-side middleware.
// Browser-side JavaScript (if server is localhost:3000, Ollama is localhost:11434)
fetch("http://localhost:11434/v1/chat/completions", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
model: "llama3.2:3b",
messages: [{ role: "user", content: "What is 2+2?" }]
})
})
.then(res => res.json())
.then(data => console.log(data.choices[0].message.content))How Do You Stream Responses Token-by-Token?
Streaming lets you display responses as they are generated, token by token, instead of waiting for the entire response.
# Python: streaming example
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama"
)
stream = client.chat.completions.create(
model="llama3.2:3b",
messages=[{"role": "user", "content": "Count to 10"}],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)Can Your Local Model Call Functions?
Yes, as of April 2026, function calling works with local models via the OpenAI API. You define a function schema, and the model can respond with arguments to pass to your function.
Function calling support depends on the model. Llama 3.2 8B, Qwen2.5, and most recent models support it. Smaller models (3B) may not reliably use it.
# Example: local model calls a weather function
tools = [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string"}
}
}
}
}]
response = client.chat.completions.create(
model="llama3.2:8b",
messages=[{"role": "user", "content": "What is the weather in SF?"}],
tools=tools
)
# Check if model returned a function call
if response.choices[0].message.tool_calls:
call = response.choices[0].message.tool_calls[0]
print(f"Call function: {call.function.name} with {call.function.arguments}")Common Mistakes With Local LLM APIs
- Forgetting that the API key is ignored. Ollama requires `api_key="ollama"` (any string works) because it is not authenticating. The real key is that the request comes from localhost or your local network.
- Not realizing the model name matters. If you call `/chat/completions` with `model="gpt-4"` but have only pulled `llama3.2:3b` in Ollama, the request will fail. Use the exact model names from `ollama list`.
- Assuming Ollama needs internet. It does not. The API is entirely local. But if your Python code tries to reach OpenAI's servers first (by default), it will fail. Always set `base_url` explicitly.
- CORS errors from browser. If you call Ollama from a browser-side script and get a CORS error, it means the browser blocked the request for security. Workaround: make the call from a server-side proxy, or ensure your app is served from localhost.
- Not setting stream=True when expecting streaming. If you want token-by-token responses, you must explicitly set `stream=True` in the request. By default, it waits for the full response.
Common Questions About Local LLM APIs
Do I need to modify my OpenAI code to use Ollama?
No. Set `base_url="http://localhost:11434/v1"` and `api_key="ollama"`. Everything else stays the same. If you have code using the OpenAI library, swap these two lines and it works with your local model.
Can I use the API from a different computer on my network?
Yes. By default, Ollama listens on localhost only. To allow network access, set the environment variable `OLLAMA_HOST=0.0.0.0:11434` before running Ollama. Then point your code to `http://<machine-ip>:11434/v1`. Be careful with security — use a firewall if this is production.
Does LM Studio have an OpenAI-compatible API?
Yes, as of April 2026, LM Studio has an OpenAI-compatible API in beta at `http://localhost:1234/v1`. Use the same code as Ollama, just change the port.
Can I call multiple models simultaneously?
If you have them loaded in Ollama, yes. But note that running two models simultaneously doubles VRAM usage. You must have enough GPU memory.
Is the API authenticated?
No. By default, Ollama's API has no authentication. Anyone with access to localhost:11434 can use it. For production with network access, add authentication via a reverse proxy (nginx with Basic Auth, etc.).
Sources
- Ollama API Documentation — github.com/ollama/ollama/blob/main/docs/api.md
- OpenAI Python Library — github.com/openai/openai-python
- OpenAI API Reference — platform.openai.com/docs/api-reference
- LM Studio Local API (Beta) — lmstudio.ai/docs/local-server/overview