Ollama's OpenAI-Compatible API: Call /v1/chat/completions
If you run models locally with Ollama, you don't have to learn a new client library to talk to them. Ollama ships an Ollama OpenAI-compatible API, which means the same OpenAI SDK and the same /v1/chat/completions request shape you already use for cloud models also works against a model running on your laptop. Point your client at http://localhost:11434/v1, pass any non-empty API key, and send a normal chat completion. This guide shows the exact curl and Python calls, what's supported, the gotchas worth knowing, and how to keep local and cloud models behind one endpoint.
Where Ollama's OpenAI endpoint lives
By default Ollama listens on port 11434. The OpenAI-compatible routes are mounted under the /v1 prefix, so the base URL you give to any OpenAI client is:
http://localhost:11434/v1
From there the familiar paths exist, including /v1/chat/completions, /v1/completions, /v1/models, and /v1/embeddings. Ollama also has its own native API under /api (for example /api/chat and /api/generate), but the whole point of the OpenAI-compatible layer is that you can reuse existing code without touching it.
One thing that trips people up first: Ollama runs no authentication locally. The OpenAI SDKs still require some API key to be set, so you pass a placeholder. Any non-empty string works — the convention is literally "ollama".
A curl call to /v1/chat/completions
Before you call a model, it has to be pulled. Run ollama pull llama3.1 (or whatever tag you want) once, then start the server with ollama serve if it isn't already running. Here is a direct request to the Ollama OpenAI endpoint:
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer ollama" \
-d '{
"model": "llama3.1",
"messages": [
{ "role": "system", "content": "You are a terse assistant." },
{ "role": "user", "content": "Explain what an embedding is in one sentence." }
]
}'
The response comes back in the standard OpenAI shape — a choices array with message.content, plus a usage object. The Authorization header is accepted but not checked, so the token value is irrelevant here.
Run a local LLM with the OpenAI Python SDK
This is where the compatibility pays off. To run a local LLM with the OpenAI SDK, you only change two things versus a cloud call: the base_url and the api_key. The model name is the Ollama tag you pulled.
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama", # required by the SDK, ignored by Ollama
)
resp = client.chat.completions.create(
model="llama3.1",
messages=[
{"role": "user", "content": "Write a haiku about local inference."},
],
)
print(resp.choices[0].message.content)
The JavaScript/TypeScript SDK follows the same pattern — set baseURL to http://localhost:11434/v1 and apiKey to "ollama". Any tool that lets you override the OpenAI base URL (LangChain, LlamaIndex, the Vercel AI SDK, an internal wrapper) can target Ollama the same way.
Streaming responses
Streaming works exactly like it does against OpenAI: set stream=True and iterate the chunks. Each delta arrives as it's generated, which matters for local models where you want tokens on screen before the full answer is done.
stream = client.chat.completions.create(
model="llama3.1",
messages=[{"role": "user", "content": "Count slowly from 1 to 10."}],
stream=True,
)
for chunk in stream:
delta = chunk.choices[0].delta.content
if delta:
print(delta, end="", flush=True)
The same data: chunk protocol is used over the wire, so a streaming client written for OpenAI parses Ollama's stream without changes.
What's supported, and what's partial
The OpenAI-compatible surface in Ollama covers the most common needs, but it is not a byte-for-byte clone of OpenAI's full API. As a practical map:
- Chat completions — well supported, including system/user/assistant roles and multi-turn conversations.
- Streaming — supported via
stream=True. - Common parameters — fields like
temperature,top_p,max_tokens,stop, andseedare mapped to Ollama's options. - Tool / function calling — available for models that support tools, but behavior varies by model and is less mature than the cloud equivalents.
- Embeddings — exposed at
/v1/embeddingsfor embedding models you've pulled. - Vision — supported for multimodal models that accept image input.
Because Ollama moves quickly and the exact parameter and feature coverage changes between releases, treat the list above as a starting point and verify against the current Ollama documentation for the version you're running before you depend on any single field — especially for tools, structured outputs, and embeddings.
Common gotchas
Most problems calling the Ollama OpenAI endpoint come down to a few recurring issues:
- The model isn't pulled. If you reference a tag you never downloaded, the request fails. Run
ollama listto see what's local andollama pull <model>to fetch it. Themodelstring in your request must match a tag exactly, including any size or quantization suffix. - Context length. Local models have a default context window that may be smaller than you expect, and long prompts can be silently truncated. Check the model's context size and configure it (for example via a Modelfile
num_ctxor request options) if you push large inputs. - No auth means no protection. The endpoint accepts any token, so if you bind Ollama to a non-loopback address you've effectively published an open API. Keep it on
localhostunless you put a real gateway or reverse proxy in front of it. - Base URL mistakes. The OpenAI client expects the
/v1suffix on the base URL; omitting it (or doubling it up to/v1/v1) is a frequent cause of 404s. The base URL ishttp://localhost:11434/v1, nothttp://localhost:11434. - Cold starts. The first request after the server loads a model into memory can be noticeably slower while weights are read from disk. Subsequent calls are faster.
One endpoint for local and cloud models
Local inference is great for development, privacy-sensitive work, and high-volume cheap calls. But a 7B–8B model running on your machine isn't always enough — sometimes you need a frontier cloud model for the hard requests. The pattern most teams land on is: keep the cheap, frequent calls on local Ollama, and fall back to a cloud provider only when the local model can't deliver.
Doing that by hand means juggling two base URLs, two key setups, and conditional logic scattered through your code. A gateway collapses it into one endpoint. You send every request to a single OpenAI-compatible URL, and the gateway decides where it goes — local model for the routine stuff, a hosted model when quality or context demands it.
That's exactly the job flo2 does. flo2 is a developer-first LLM gateway with zero token markup: you bring your own provider keys (OpenAI, Anthropic, Google Gemini, Groq, Cerebras, DeepInfra, Mistral, xAI, OpenRouter) and pay providers directly. A single key — usable through both an OpenAI-compatible and an Anthropic-compatible API — routes each request to the cheapest or fastest model, with fallback chains so a request can start cheap and escalate when needed. It also gives you smart routing, AI racing, A/B testing with an LLM judge to gauge model–task fit, opt-in response caching, and true per-call cost accounting. As the zero-markup OpenRouter alternative, it's free during Beta.
The mental model is the same one Ollama gives you locally — one familiar API, many models behind it — just extended across your local box and every cloud provider at once. If you want the bigger picture first, read what is an LLM gateway, then route your local Ollama and cloud models through a single key with flo2.