2026-06-03 · flo2 blog

Ollama's OpenAI-Compatible API: Call /v1/chat/completions

If you run models locally with Ollama, you don't have to learn a new client library to talk to them. Ollama ships an Ollama OpenAI-compatible API, which means the same OpenAI SDK and the same /v1/chat/completions request shape you already use for cloud models also works against a model running on your laptop. Point your client at http://localhost:11434/v1, pass any non-empty API key, and send a normal chat completion. This guide shows the exact curl and Python calls, what's supported, the gotchas worth knowing, and how to keep local and cloud models behind one endpoint.

Where Ollama's OpenAI endpoint lives

By default Ollama listens on port 11434. The OpenAI-compatible routes are mounted under the /v1 prefix, so the base URL you give to any OpenAI client is:

http://localhost:11434/v1

From there the familiar paths exist, including /v1/chat/completions, /v1/completions, /v1/models, and /v1/embeddings. Ollama also has its own native API under /api (for example /api/chat and /api/generate), but the whole point of the OpenAI-compatible layer is that you can reuse existing code without touching it.

One thing that trips people up first: Ollama runs no authentication locally. The OpenAI SDKs still require some API key to be set, so you pass a placeholder. Any non-empty string works — the convention is literally "ollama".

A curl call to /v1/chat/completions

Before you call a model, it has to be pulled. Run ollama pull llama3.1 (or whatever tag you want) once, then start the server with ollama serve if it isn't already running. Here is a direct request to the Ollama OpenAI endpoint:

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer ollama" \
  -d '{
    "model": "llama3.1",
    "messages": [
      { "role": "system", "content": "You are a terse assistant." },
      { "role": "user", "content": "Explain what an embedding is in one sentence." }
    ]
  }'

The response comes back in the standard OpenAI shape — a choices array with message.content, plus a usage object. The Authorization header is accepted but not checked, so the token value is irrelevant here.

Run a local LLM with the OpenAI Python SDK

This is where the compatibility pays off. To run a local LLM with the OpenAI SDK, you only change two things versus a cloud call: the base_url and the api_key. The model name is the Ollama tag you pulled.

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # required by the SDK, ignored by Ollama
)

resp = client.chat.completions.create(
    model="llama3.1",
    messages=[
        {"role": "user", "content": "Write a haiku about local inference."},
    ],
)

print(resp.choices[0].message.content)

The JavaScript/TypeScript SDK follows the same pattern — set baseURL to http://localhost:11434/v1 and apiKey to "ollama". Any tool that lets you override the OpenAI base URL (LangChain, LlamaIndex, the Vercel AI SDK, an internal wrapper) can target Ollama the same way.

Streaming responses

Streaming works exactly like it does against OpenAI: set stream=True and iterate the chunks. Each delta arrives as it's generated, which matters for local models where you want tokens on screen before the full answer is done.

stream = client.chat.completions.create(
    model="llama3.1",
    messages=[{"role": "user", "content": "Count slowly from 1 to 10."}],
    stream=True,
)

for chunk in stream:
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="", flush=True)

The same data: chunk protocol is used over the wire, so a streaming client written for OpenAI parses Ollama's stream without changes.

What's supported, and what's partial

The OpenAI-compatible surface in Ollama covers the most common needs, but it is not a byte-for-byte clone of OpenAI's full API. As a practical map:

Chat completions — well supported, including system/user/assistant roles and multi-turn conversations.
Streaming — supported via stream=True.
Common parameters — fields like temperature, top_p, max_tokens, stop, and seed are mapped to Ollama's options.
Tool / function calling — available for models that support tools, but behavior varies by model and is less mature than the cloud equivalents.
Embeddings — exposed at /v1/embeddings for embedding models you've pulled.
Vision — supported for multimodal models that accept image input.

Because Ollama moves quickly and the exact parameter and feature coverage changes between releases, treat the list above as a starting point and verify against the current Ollama documentation for the version you're running before you depend on any single field — especially for tools, structured outputs, and embeddings.

Common gotchas

Most problems calling the Ollama OpenAI endpoint come down to a few recurring issues:

The model isn't pulled. If you reference a tag you never downloaded, the request fails. Run ollama list to see what's local and ollama pull <model> to fetch it. The model string in your request must match a tag exactly, including any size or quantization suffix.
Context length. Local models have a default context window that may be smaller than you expect, and long prompts can be silently truncated. Check the model's context size and configure it (for example via a Modelfile num_ctx or request options) if you push large inputs.
No auth means no protection. The endpoint accepts any token, so if you bind Ollama to a non-loopback address you've effectively published an open API. Keep it on localhost unless you put a real gateway or reverse proxy in front of it.
Base URL mistakes. The OpenAI client expects the /v1 suffix on the base URL; omitting it (or doubling it up to /v1/v1) is a frequent cause of 404s. The base URL is http://localhost:11434/v1, not http://localhost:11434.
Cold starts. The first request after the server loads a model into memory can be noticeably slower while weights are read from disk. Subsequent calls are faster.

One endpoint for local and cloud models

Local inference is great for development, privacy-sensitive work, and high-volume cheap calls. But a 7B–8B model running on your machine isn't always enough — sometimes you need a frontier cloud model for the hard requests. The pattern most teams land on is: keep the cheap, frequent calls on local Ollama, and fall back to a cloud provider only when the local model can't deliver.

Doing that by hand means juggling two base URLs, two key setups, and conditional logic scattered through your code. A gateway collapses it into one endpoint. You send every request to a single OpenAI-compatible URL, and the gateway decides where it goes — local model for the routine stuff, a hosted model when quality or context demands it.

That's exactly the job flo2 does. flo2 is a developer-first LLM gateway with zero token markup: you bring your own provider keys (OpenAI, Anthropic, Google Gemini, Groq, Cerebras, DeepInfra, Mistral, xAI, OpenRouter) and pay providers directly. A single key — usable through both an OpenAI-compatible and an Anthropic-compatible API — routes each request to the cheapest or fastest model, with fallback chains so a request can start cheap and escalate when needed. It also gives you smart routing, AI racing, A/B testing with an LLM judge to gauge model–task fit, opt-in response caching, and true per-call cost accounting. As the zero-markup OpenRouter alternative, it's free during Beta.

The mental model is the same one Ollama gives you locally — one familiar API, many models behind it — just extended across your local box and every cloud provider at once. If you want the bigger picture first, read what is an LLM gateway, then route your local Ollama and cloud models through a single key with flo2.

One key, every model — zero markup.

Bring your own provider keys. flo2 routes to the cheapest, fastest model with fallback, racing and true cost accounting — free during Beta.

Get your flo2 key →