2026-06-03 · flo2 blog

Use Cerebras with the OpenAI SDK: Compatible API & Base URL

Cerebras Inference is one of the fastest LLM backends available today, and it exposes a Cerebras OpenAI-compatible endpoint — meaning you can point the standard openai Python or JavaScript client straight at Cerebras by changing three values: the base URL, your API key, and the model name. No new SDK, no custom client, no rewrite. This guide covers where the endpoint lives (and where to confirm the exact URL in Cerebras's official docs), walks through curl and Python examples, explains what the compatibility layer supports, and shows how to pair Cerebras's very fast inference with a fallback gateway so you get speed without single-provider fragility.

Cerebras's OpenAI-compatible endpoint

Cerebras Inference provides a Chat Completions API that follows the OpenAI wire format. The base URL for the compatible endpoint is documented in the Cerebras Inference docscheck the current value there rather than hard-coding anything you read in a blog post, since provider URLs occasionally change. At time of writing the endpoint path follows the pattern https://<cerebras-host>/v1, but confirm the exact host in the official documentation before you ship.

Authentication uses a standard bearer token: your Cerebras API key, created in the Cerebras Cloud console. The Authorization: Bearer <key> header is identical to what the OpenAI SDK sends by default, which is exactly why the compatibility works.

Similarly, model IDs are Cerebras-specific — there is no gpt-4o here. Cerebras hosts its own model catalog; verify the current list and exact model identifiers in the Cerebras Inference docs before you commit a model string to your codebase.

A curl request to Cerebras chat completions

Once you have your key and the confirmed base URL, a raw curl call is the fastest way to validate your credentials:

export CEREBRAS_API_KEY="your_cerebras_key"
export CEREBRAS_BASE_URL="https://api.cerebras.ai/v1"   # confirm in Cerebras docs

curl "$CEREBRAS_BASE_URL/chat/completions" \
  -H "Authorization: Bearer $CEREBRAS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1-8b",
    "messages": [
      {"role": "system", "content": "You are concise."},
      {"role": "user",   "content": "Explain wafer-scale inference in one sentence."}
    ]
  }'

The response is the standard OpenAI shape: a choices array, a message.content string, a finish_reason, and a usage object with prompt_tokens, completion_tokens, and total_tokens. If you see a clean JSON response you're good to move to the SDK.

Use Cerebras with the OpenAI Python SDK

The entire migration is constructor arguments. Everything downstream — your message building, streaming loops, response parsing — stays the same:

import os
from openai import OpenAI

client = OpenAI(
    base_url=os.environ["CEREBRAS_BASE_URL"],  # confirm exact URL in Cerebras docs
    api_key=os.environ["CEREBRAS_API_KEY"],
)

resp = client.chat.completions.create(
    model="llama3.1-8b",   # use a model ID from the Cerebras docs
    messages=[
        {"role": "system", "content": "You are a terse assistant."},
        {"role": "user",   "content": "What is wafer-scale compute good for?"},
    ],
)

print(resp.choices[0].message.content)
print(resp.usage)  # prompt_tokens, completion_tokens, total_tokens

Any library that accepts an OpenAI base URL override — LangChain, LlamaIndex, instructor, the Vercel AI SDK — works the same way. They all construct the same Chat Completions HTTP request under the hood.

Streaming with Cerebras

Cerebras supports streaming responses — set stream=True and iterate as you would with any OpenAI-compatible provider. Given Cerebras's throughput, streaming is especially effective: the model produces tokens fast enough that the stream completes well before a non-streaming equivalent would even return the first byte on a slow connection.

stream = client.chat.completions.create(
    model="llama3.1-8b",
    messages=[{"role": "user", "content": "Summarize the CUDA memory model briefly."}],
    stream=True,
)

for chunk in stream:
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="", flush=True)

The wire protocol is server-sent events with data: lines and a terminal data: [DONE], identical to OpenAI. Existing streaming parsers work unchanged. To capture token counts from a stream, check whether Cerebras supports stream_options={"include_usage": True} in their docs — that parameter tells compatible providers to include a usage block in the final chunk.

JavaScript / TypeScript

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: process.env.CEREBRAS_BASE_URL,  // confirm in Cerebras docs
  apiKey:  process.env.CEREBRAS_API_KEY,
});

const resp = await client.chat.completions.create({
  model: "llama3.1-8b",  // verify current model IDs in Cerebras docs
  messages: [{ role: "user", content: "Write a haiku about fast inference." }],
});

console.log(resp.choices[0].message.content);

What Cerebras's compatibility layer covers

Cerebras's endpoint targets the Chat Completions surface, not every OpenAI product. Here is what to expect and what to verify:

Gotchas when migrating an existing app to Cerebras

The three-value migration is genuinely that simple for apps that only use core chat completions. These are the items that most commonly cause problems:

Why Cerebras + OpenAI SDK is a strong pairing

Cerebras Inference is built on wafer-scale compute that can generate tokens at throughputs significantly higher than GPU-based inference. In practice this means:

The detailed walkthrough of the Cerebras API — keys, pricing tiers, rate limits, model tradeoffs — is in the Cerebras API guide. For the broader picture of why so many providers converge on the same wire format, see OpenAI-compatible API.

Racing and falling back Cerebras behind a gateway

Pointing the OpenAI SDK directly at Cerebras is the right first step. The problem is that it hard-codes a single provider: when Cerebras rate-limits you, a specific model isn't available, or you want to benchmark against another inference backend, you're back to editing application code. A gateway separates provider selection from application logic.

That is what flo2 is built for. flo2 is a developer-first LLM gateway with zero token markup — you bring your own Cerebras key (and keys for OpenAI, Anthropic, Groq, Gemini, DeepInfra, Mistral, xAI, and OpenRouter) and pay each provider directly. A single flo2 key, usable through an OpenAI-compatible and Anthropic-compatible endpoint, routes each request to the provider that is cheapest or fastest for the task, with fallback chains and AI racing so a Cerebras 429 or quota wall rolls over to another provider instead of surfacing as an error to your users.

import os
from openai import OpenAI

# One base URL, one key — flo2 routes to Cerebras (or wherever is best)
client = OpenAI(
    base_url="https://flo2.com/v1",
    api_key=os.environ["FLO2_API_KEY"],
)

resp = client.chat.completions.create(
    model="auto",   # or pin to a specific Cerebras model; flo2 falls back for you
    messages=[
        {"role": "user", "content": "Summarize this stack trace."},
    ],
)

print(resp.choices[0].message.content)

Because flo2 exposes the same OpenAI-compatible surface you just used against Cerebras, switching is — again — a base_url and api_key change. You get Cerebras's throughput when it is the best fit, automatic fallback when it is not, AI racing to whoever responds first, A/B testing with an LLM judge, opt-in response caching, and real per-call cost accounting across every provider in one view. Free during Beta. Get Cerebras speed with multi-provider resilience at flo2.

One key, every model — zero markup.
Bring your own provider keys. flo2 routes to the cheapest, fastest model with fallback, racing and true cost accounting — free during Beta.
Get your flo2 key →
© 2026 flo2.com — the zero-markup LLM gateway & router. flow → to