PromptQuorumPromptQuorum
Home/Local LLMs/Ollama OpenAI API: Python & Node.js Integration in 3 Steps (Code Examples + Streaming + Tool Calling)
Tools & Interfaces

Ollama OpenAI API: Python & Node.js Integration in 3 Steps (Code Examples + Streaming + Tool Calling)

Β·10 min readΒ·By Hans Kuepper Β· Founder of PromptQuorum, multi-model AI dispatch tool Β· PromptQuorum

LM Studio (localhost:1234), Ollama (localhost:11434), and vLLM (localhost:8000) all expose REST APIs in the OpenAI format. Use the official OpenAI Python or Node.js SDK with any local model by changing two lines: set base_url to your local endpoint and api_key to any string.

LM Studio (localhost:1234), Ollama (localhost:11434), and vLLM (localhost:8000) all expose REST APIs in the OpenAI format. Use the official OpenAI Python or Node.js SDK with any local model by changing two lines: set base_url to your local endpoint and api_key to any string. As of May 2026, this is the standard way to run local LLMs in production Python and Node.js applications without cloud costs or vendor lock-in.

Slide Deck: Ollama OpenAI API: Python & Node.js Integration in 3 Steps (Code Examples + Streaming + Tool Calling)

The slide deck below covers: the OpenAI-compatible API standard, Ollama endpoint setup, Python and Node.js integration in 3 steps, streaming, function calling, and regional compliance (EU GDPR, Japan APPI, China CAC). Download the PDF as a Local LLM API integration reference card.

Browse the slides below or download as PDF for offline reference. Download Reference Card (PDF)

Key Takeaways

  • Ollama exposes a REST API at `http://localhost:11434/v1` that mirrors OpenAI's API exactly.
  • Use the OpenAI Python library: change `api_key="openai"` to `api_key="ollama"` and `base_url="http://localhost:11434/v1"`.
  • Same approach in Node.js: OpenAI SDK, point to localhost:11434.
  • The OpenAI-compatible API is identical across Ollama, vLLM, and LM Studio -- no code changes needed to switch providers.
  • As of May 2026, streaming (streaming responses token-by-token) and function calling both work with local models via this API.

⚑ Quick Facts

Ollama API: `http://localhost:11434/v1` β€” mirrors OpenAI's `/chat/completions` exactly

LM Studio API: `http://localhost:1234/v1` β€” same format, different port

vLLM API: `http://localhost:8000/v1` β€” production-grade serving

Code change: 2 lines β€” `base_url` and `api_key`. All other code stays identical.

Supported: Chat completions, text completions, embeddings, streaming, function calling

Authentication: None by default β€” localhost access only. Add reverse proxy for network access.

Model for code examples: Llama 4 Scout (best quality on 12 GB) or Llama 3.2 3B (lightweight)

What Does OpenAI-Compatible Mean?

OpenAI-compatible means the API endpoint returns responses in the same format as OpenAI's API. This allows any library or tool built for OpenAI to work with local models by pointing to a different URL. Learn how Ollama vs LM Studio compare in their implementation of this standard.

Example: The OpenAI Python library sends requests like this:

``` POST /chat/completions { "model": "gpt-4o", "messages": [...], "temperature": 0.7 } ```

Ollama's API accepts the exact same request at `localhost:11434/v1/chat/completions` and returns the response in OpenAI's format:

``` { "choices": [{"message": {"content": "..."}}], "usage": {"prompt_tokens": 10, "completion_tokens": 20} } ```

Because the format is identical, you do not need to learn a new API or rewrite your code.

---

πŸ” Did You Know: The OpenAI API format has become the unofficial standard for all LLM APIs. Anthropic (Claude), Google (Gemini), and every major local inference tool (Ollama, vLLM, LM Studio, llama.cpp) now support it. Code written against this format is truly provider-agnostic β€” the closest thing the AI industry has to a universal API.

Switching from OpenAI to Ollama requires changing 2 lines -- base_url and api_key -- all other code stays identical.
Switching from OpenAI to Ollama requires changing 2 lines -- base_url and api_key -- all other code stays identical.

What Is Ollama's API Endpoint?

**When you run `ollama serve`, Ollama starts a REST API at `http://localhost:11434`.** The OpenAI-compatible endpoints are:

EndpointURLDescription
Chat CompletionsPOST http://localhost:11434/v1/chat/completionsMatches `/chat/completions` from OpenAI
Text CompletionsPOST http://localhost:11434/v1/completionsMatches `/completions` from OpenAI
EmbeddingsPOST http://localhost:11434/v1/embeddingsConvert text to vectors
List ModelsGET http://localhost:11434/v1/modelsList available models
Ollama intercepts the OpenAI-formatted request and runs inference locally -- the response returns in identical OpenAI format, no internet required.
Ollama intercepts the OpenAI-formatted request and runs inference locally -- the response returns in identical OpenAI format, no internet required.

How to Use Ollama API With Python (OpenAI Library)?

Install the OpenAI library and point it to localhost.

πŸ” Pro Tip: Set `OPENAI_BASE_URL=http://localhost:11434/v1` as an environment variable. Many tools (LangChain, LlamaIndex, aider) read this variable automatically β€” no code changes needed. You can switch between OpenAI and Ollama by changing a single env var.

python
# 1. Install the OpenAI library
pip install openai

# 2. Connect to Ollama
from openai import OpenAI

client = OpenAI(
  base_url="http://localhost:11434/v1",
  api_key="ollama"  # dummy key; Ollama ignores it
)

# 3. Make a request
response = client.chat.completions.create(
  model="llama4:scout",  # Best quality on 12 GB VRAM (MoE)
  # model="llama3.2:3b",  # Lightweight alternative for 8 GB RAM
  messages=[
    {"role": "user", "content": "What is 2+2?"}
  ]
)

print(response.choices[0].message.content)

How to Use Ollama API With Node.js?

Install the OpenAI SDK and connect it to your local Ollama instance.

javascript
// 1. Install
npm install openai

// 2. Connect to Ollama
const OpenAI = require("openai").default;

const client = new OpenAI({
  baseURL: "http://localhost:11434/v1",
  apiKey: "ollama"
});

// 3. Make a request
const response = await client.chat.completions.create({
  model: "llama4:scout",       // Best quality on 12 GB VRAM
  // model: "llama3.2:3b",     // Lightweight for 8 GB RAM
  messages: [{
    role: "user",
    content: "What is 2+2?"
  }]
});

console.log(response.choices[0].message.content);

How to Use LM Studio OpenAI-Compatible Server (localhost:1234)

**LM Studio exposes an OpenAI-compatible API at `http://localhost:1234/v1`.** Enable it under the Local Server tab -- load a model, then click Start Server. The same Python and Node.js code works with LM Studio -- change only the port from 11434 to 1234.

LM Studio is suited for GUI users who want visual model browsing and easy switching between models. Ollama is preferred for scripting, automation, and CI pipelines.

PlatformPortBest ForGPU Required
LM Studiolocalhost:1234GUI users, visual model managementNo (CPU works)
Ollamalocalhost:11434Scripting, automation, productionNo (CPU works)
vLLMlocalhost:8000Multi-GPU, high-throughput serversRecommended
python
# Python: Connect to LM Studio (localhost:1234)
from openai import OpenAI

client = OpenAI(
  base_url="http://localhost:1234/v1",
  api_key="lm-studio"  # any string; LM Studio ignores it
)

response = client.chat.completions.create(
  model="llama-3.2-3b-instruct",  # exact model name shown in LM Studio
  messages=[
    {"role": "user", "content": "What is 2+2?"}
  ]
)

print(response.choices[0].message.content)

How to Use Ollama API From JavaScript in the Browser?

Calling Ollama from browser-side JavaScript requires the browser and server to be on the same machine (or allow CORS). For security, browser requests to localhost work only if the JavaScript is served from localhost. Check Best Local LLM Frontends for browser-ready UIs that handle this seamlessly.

If you need to call Ollama from a browser on a different IP, set up a CORS proxy or use a server-side middleware.

javascript
// Browser-side JavaScript (if server is localhost:3000, Ollama is localhost:11434)
fetch("http://localhost:11434/v1/chat/completions", {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: JSON.stringify({
    model: "llama4:scout",      // Best quality on 12 GB VRAM
    // model: "llama3.2:3b",    // Lightweight for 8 GB RAM
    messages: [{ role: "user", content: "What is 2+2?" }]
  })
})
  .then(res => res.json())
  .then(data => console.log(data.choices[0].message.content))

How Do You Stream Responses Token-by-Token?

Streaming lets you display responses as they are generated, token by token, instead of waiting for the entire response. As of May 2026, streaming works with all local models via the OpenAI-compatible API.

python
# Python: streaming example
from openai import OpenAI

client = OpenAI(
  base_url="http://localhost:11434/v1",
  api_key="ollama"
)

stream = client.chat.completions.create(
  model="llama4:scout",
  messages=[{"role": "user", "content": "Count to 10"}],
  stream=True
)

for chunk in stream:
  if chunk.choices[0].delta.content:
    print(chunk.choices[0].delta.content, end="", flush=True)
With stream=True, Ollama delivers the first token in ~0.1s -- users see output immediately instead of waiting for the full response.
With stream=True, Ollama delivers the first token in ~0.1s -- users see output immediately instead of waiting for the full response.

Can Your Local Model Call Functions?

Yes, as of May 2026, function calling works with local models via the OpenAI API. You define a function schema, and the model can respond with arguments to pass to your function. This enables Best Local LLMs for Coding to integrate with your tool ecosystem.

Function calling support depends on the model. Llama 4 Scout, Qwen3 8B, Gemma 4 9B, and Mistral Small 3.1 all support tool calling reliably. Llama 3.1 8B and Qwen2.5 7B also support it (legacy). Smaller models (3B) may not reliably produce structured tool call JSON.

In 2026, the Model Context Protocol (MCP) extends function calling into a standardized tool connection layer. MCP lets any client (Claude Code, Cursor, custom apps) connect to any tool server via a single protocol β€” going beyond the per-request tool definitions shown above. Ollama supports MCP-style tool calling through the standard OpenAI-compatible function calling API. For production tool integrations, MCP is becoming the standard; the function calling examples here remain the foundation.

When using OpenAI-compatible APIs locally, structured output and JSON mode work the same way as with cloud APIs. For enforcing schema compliance and format control across local and cloud models, see structured output and JSON mode.

OpenAI-compatible APIs accept the same prompt formats as the cloud versions β€” system messages, user messages, and structured output. The full library of prompt engineering techniques applies directly to local API calls.

python
# Example: local model calls a weather function
tools = [{
  "type": "function",
  "function": {
    "name": "get_weather",
    "description": "Get current weather",
    "parameters": {
      "type": "object",
      "properties": {
        "location": {"type": "string"}
      }
    }
  }
}]

response = client.chat.completions.create(
  model="llama4:scout",
  messages=[{"role": "user", "content": "What is the weather in SF?"}],
  tools=tools
)

# Check if model returned a function call
if response.choices[0].message.tool_calls:
  call = response.choices[0].message.tool_calls[0]
  print(f"Call function: {call.function.name} with {call.function.arguments}")
Function calling flow with Ollama: the local model returns tool_call JSON, and your app executes the function -- supported by Llama 4 Scout, Qwen3 8B, Gemma 4 9B, and Mistral.
Function calling flow with Ollama: the local model returns tool_call JSON, and your app executes the function -- supported by Llama 4 Scout, Qwen3 8B, Gemma 4 9B, and Mistral.

Local LLM OpenAI APIs by Region

EU / GDPR & AI Act: For EU developers, running Ollama locally ensures GDPR Article 5 compliance (data minimization) -- all inference stays on-device with no data egress to cloud APIs. Ollama downloads from GitHub under MIT license, meeting EU compliance requirements. EU AI Act high-risk system obligations apply from August 2, 2026 (pending Digital Omnibus). Local API inference satisfies GDPR data residency requirements by default. For enterprises, this eliminates vendor lock-in and guarantees data residency.

Japan / APPI: Under Japan's Act on Protection of Personal Information (APPI), on-premises model inference bypasses cloud data transfer requirements. Ollama + Qwen3 8B runs on standard corporate laptops (8 GB RAM) with improved Japanese language support over Qwen2.5, with 30-50 tok/sec latency meeting real-time response expectations for Japanese language processing.

China / CAC: For deployment under China's Cybersecurity Law (CAC Article 37), local inference satisfies data localization mandates -- Ollama + Qwen3 runs on any Linux device without external API calls. Qwen3's native Chinese tokenization adds 30-40% efficiency over Llama, reducing local inference overhead.

What Are Common Mistakes With Local LLM OpenAI APIs?

  • Forgetting that the API key is ignored. Ollama requires `api_key="ollama"` (any string works) because it is not authenticating. The real authentication is that the request comes from localhost or your local network, not the internet.
  • Not realizing the model name matters. If you call `/chat/completions` with `model="gpt-4"` but have only pulled `llama3.2:3b` in Ollama, the request will fail. Use the exact model names from `ollama list`.
  • Assuming Ollama needs internet. It does not. The API is entirely local. But if your Python code tries to reach OpenAI's servers first (by default), it will fail. Always set `base_url` explicitly.
  • CORS errors from browser. If you call Ollama from a browser-side script and get a CORS error, it means the browser blocked the request for security. See Local LLMs with VS Code and Cursor for editor-based solutions that bypass CORS.
  • Not setting stream=True when expecting streaming. If you want token-by-token responses, you must explicitly set `stream=True` in the request. By default, it waits for the full response.
  • Using `llama3.2:3b` in examples when better models are available. Many tutorials still use Llama 3.2 3B because it fits on 8 GB RAM. If you have 12+ GB VRAM, switch to `llama4:scout` β€” dramatically better quality for the same API code. Only use 3B models for testing API integration, not production workloads.
  • Not setting `OLLAMA_NUM_PARALLEL` for concurrent requests. By default, Ollama processes one request at a time. For multi-user apps or parallel test suites, set `OLLAMA_NUM_PARALLEL=4` (or higher) to handle concurrent API calls. Without this, requests queue and latency spikes.
  • ---
  • ⚠️ Warning: Ollama's API has NO authentication by default. If you expose it to your network (`OLLAMA_HOST=0.0.0.0`), anyone on that network can send requests, load models, and consume GPU resources. For multi-user or production setups, place a reverse proxy (nginx, Caddy) with authentication in front of Ollama β€” never expose port 11434 directly to the internet.
Ollama (port 11434), vLLM (port 8000), and LM Studio (port 1234) all expose OpenAI-compatible endpoints -- identical client code, different ports and use cases.
Ollama (port 11434), vLLM (port 8000), and LM Studio (port 1234) all expose OpenAI-compatible endpoints -- identical client code, different ports and use cases.

Common Questions About Local LLM APIs

Do I need to modify my OpenAI code to use Ollama?

No. Set `base_url="http://localhost:11434/v1"` and `api_key="ollama"`. Everything else stays the same. If you have code using the OpenAI library, swap these two lines and it works with your local model.

Can I use the API from a different computer on my network?

Yes. By default, Ollama listens on localhost only. To allow network access, set the environment variable `OLLAMA_HOST=0.0.0.0:11434` before running Ollama. Then point your code to `http://<machine-ip>:11434/v1`. Be careful with security -- use a firewall if this is production.

Does LM Studio have an OpenAI-compatible API?

Yes. LM Studio exposes an OpenAI-compatible API at `http://localhost:1234/v1`. Enable it under the Local Server tab, load a model, then click Start Server. Use the same Python or Node.js code as Ollama -- only the port changes (1234 instead of 11434).

Can I call multiple models simultaneously?

If you have them loaded in Ollama, yes. But note that running two models simultaneously doubles VRAM usage. You must have enough GPU memory.

Is the API authenticated?

No. By default, Ollama's API has no authentication. Anyone with access to localhost:11434 can use it. For production with network access, add authentication via a reverse proxy (nginx with Basic Auth, etc.).

How do I use streaming with the Ollama OpenAI API?

Set stream=True in your OpenAI library call. Ollama returns server-sent events (SSE) with each token. In Python: for chunk in client.chat.completions.create(stream=True, ...): print(chunk.choices[0].delta.content).

Does Ollama support function calling / tool use via the API?

Yes, for models that support it (Llama 4 Scout, Qwen3 8B, Gemma 4 9B, Mistral Small 3.1). Legacy models (Llama 3.1 8B, Qwen2.5 7B) also supported. Pass tools=[] in the API call as you would with OpenAI. Ollama parses tool calls and returns structured JSON. Not all models support this -- check model documentation.

What is MCP and how does it relate to the OpenAI-compatible API?

MCP (Model Context Protocol) is a standardized protocol for connecting AI models to external tools and data sources. It builds on top of function calling β€” the same `tools=[]` parameter shown in the examples above β€” but adds a standard server-client architecture so tools are discoverable and reusable across applications. Ollama supports MCP-style tool interactions through its OpenAI-compatible function calling endpoint. For simple integrations, the function calling examples in this article are sufficient. For complex multi-tool workflows, MCP provides a more structured approach.

What is the difference between Ollama /api/generate and /v1/chat/completions?

/api/generate is Ollama's native single-turn endpoint. /v1/chat/completions is the OpenAI-compatible multi-turn endpoint. Use /v1/chat/completions for all new projects -- it supports conversation history and is compatible with OpenAI libraries.

Can I use vLLM as an OpenAI-compatible API?

Yes. vLLM runs an OpenAI-compatible server at http://localhost:8000/v1 by default. Start it with: python -m vllm.entrypoints.openai.api_server --model mistralai/Mistral-7B-v0.1. Use the same client code as Ollama.

How do I use the Ollama API with the Node.js openai package?

Import OpenAI from openai. Set baseURL: "http://localhost:11434/v1" and apiKey: "ollama" in the constructor. Then call client.chat.completions.create() exactly as you would with the real OpenAI API -- no other changes needed.

How do I switch between Ollama and OpenAI in the same codebase?

Use an environment variable: set USE_LOCAL=true for Ollama (base_url http://localhost:11434/v1, api_key "ollama") and USE_LOCAL=false for OpenAI. The OpenAI Python library accepts base_url as a constructor argument. Set USE_LOCAL=false in production to switch to OpenAI without changing any other code.

Can I use the OpenAI-compatible API with LangChain?

Yes. Use ChatOpenAI with base_url="http://localhost:11434/v1" and api_key="ollama". This makes Ollama a drop-in replacement for OpenAI in any LangChain pipeline -- RAG chains, agents, and tools all work without modification. LangChain also has a dedicated ChatOllama class for Ollama-specific features.

Sources

A Note on Third-Party Facts

This article references third-party AI models, benchmarks, prices, and licenses. The AI landscape changes rapidly. Benchmark scores, license terms, model names, and API prices can shift between the time of writing and the time you read this. Before making deployment or compliance decisions based on this article, verify current figures on each provider's official source: Hugging Face model cards for licenses and benchmarks, provider websites for API pricing, and EUR-Lex for current GDPR and EU AI Act text. This article reflects publicly available information as of May 2026.

Compare your local LLM against 25+ cloud models simultaneously with PromptQuorum.

Join the PromptQuorum Waitlist β†’

← Back to Local LLMs

LM Studio & Ollama OpenAI API: Python & Node.js Setup (2026)