Use Cerebras with the OpenAI SDK: Compatible API & Base URL
Cerebras Inference is one of the fastest LLM backends available today, and it exposes a Cerebras OpenAI-compatible endpoint — meaning you can point the standard openai Python or JavaScript client straight at Cerebras by changing three values: the base URL, your API key, and the model name. No new SDK, no custom client, no rewrite. This guide covers where the endpoint lives (and where to confirm the exact URL in Cerebras's official docs), walks through curl and Python examples, explains what the compatibility layer supports, and shows how to pair Cerebras's very fast inference with a fallback gateway so you get speed without single-provider fragility.
Cerebras's OpenAI-compatible endpoint
Cerebras Inference provides a Chat Completions API that follows the OpenAI wire format. The base URL for the compatible endpoint is documented in the Cerebras Inference docs — check the current value there rather than hard-coding anything you read in a blog post, since provider URLs occasionally change. At time of writing the endpoint path follows the pattern https://<cerebras-host>/v1, but confirm the exact host in the official documentation before you ship.
Authentication uses a standard bearer token: your Cerebras API key, created in the Cerebras Cloud console. The Authorization: Bearer <key> header is identical to what the OpenAI SDK sends by default, which is exactly why the compatibility works.
Similarly, model IDs are Cerebras-specific — there is no gpt-4o here. Cerebras hosts its own model catalog; verify the current list and exact model identifiers in the Cerebras Inference docs before you commit a model string to your codebase.
A curl request to Cerebras chat completions
Once you have your key and the confirmed base URL, a raw curl call is the fastest way to validate your credentials:
export CEREBRAS_API_KEY="your_cerebras_key"
export CEREBRAS_BASE_URL="https://api.cerebras.ai/v1" # confirm in Cerebras docs
curl "$CEREBRAS_BASE_URL/chat/completions" \
-H "Authorization: Bearer $CEREBRAS_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.1-8b",
"messages": [
{"role": "system", "content": "You are concise."},
{"role": "user", "content": "Explain wafer-scale inference in one sentence."}
]
}'
The response is the standard OpenAI shape: a choices array, a message.content string, a finish_reason, and a usage object with prompt_tokens, completion_tokens, and total_tokens. If you see a clean JSON response you're good to move to the SDK.
Use Cerebras with the OpenAI Python SDK
The entire migration is constructor arguments. Everything downstream — your message building, streaming loops, response parsing — stays the same:
import os
from openai import OpenAI
client = OpenAI(
base_url=os.environ["CEREBRAS_BASE_URL"], # confirm exact URL in Cerebras docs
api_key=os.environ["CEREBRAS_API_KEY"],
)
resp = client.chat.completions.create(
model="llama3.1-8b", # use a model ID from the Cerebras docs
messages=[
{"role": "system", "content": "You are a terse assistant."},
{"role": "user", "content": "What is wafer-scale compute good for?"},
],
)
print(resp.choices[0].message.content)
print(resp.usage) # prompt_tokens, completion_tokens, total_tokens
Any library that accepts an OpenAI base URL override — LangChain, LlamaIndex, instructor, the Vercel AI SDK — works the same way. They all construct the same Chat Completions HTTP request under the hood.
Streaming with Cerebras
Cerebras supports streaming responses — set stream=True and iterate as you would with any OpenAI-compatible provider. Given Cerebras's throughput, streaming is especially effective: the model produces tokens fast enough that the stream completes well before a non-streaming equivalent would even return the first byte on a slow connection.
stream = client.chat.completions.create(
model="llama3.1-8b",
messages=[{"role": "user", "content": "Summarize the CUDA memory model briefly."}],
stream=True,
)
for chunk in stream:
delta = chunk.choices[0].delta.content
if delta:
print(delta, end="", flush=True)
The wire protocol is server-sent events with data: lines and a terminal data: [DONE], identical to OpenAI. Existing streaming parsers work unchanged. To capture token counts from a stream, check whether Cerebras supports stream_options={"include_usage": True} in their docs — that parameter tells compatible providers to include a usage block in the final chunk.
JavaScript / TypeScript
import OpenAI from "openai";
const client = new OpenAI({
baseURL: process.env.CEREBRAS_BASE_URL, // confirm in Cerebras docs
apiKey: process.env.CEREBRAS_API_KEY,
});
const resp = await client.chat.completions.create({
model: "llama3.1-8b", // verify current model IDs in Cerebras docs
messages: [{ role: "user", content: "Write a haiku about fast inference." }],
});
console.log(resp.choices[0].message.content);
What Cerebras's compatibility layer covers
Cerebras's endpoint targets the Chat Completions surface, not every OpenAI product. Here is what to expect and what to verify:
- Chat completions — supported. Multi-turn conversations with system, user, and assistant roles work.
- Streaming — supported via
stream=True. Verify whether streaming is available for all hosted models in Cerebras's docs. - Common sampling parameters —
temperature,top_p,max_tokens, andstopare generally supported; verify model-level limits in the Cerebras docs. - Tool / function calling — support is model-dependent and evolves over time. Check the Cerebras docs for which hosted models accept the
toolsparameter before relying on it in production. - Structured output / JSON mode — verify in Cerebras docs.
response_format: {type: "json_object"}behavior is model-dependent across providers. - Not available — Assistants API, Responses API, image generation, audio/TTS, and fine-tuning are OpenAI-specific products that do not exist in Cerebras's surface.
Gotchas when migrating an existing app to Cerebras
The three-value migration is genuinely that simple for apps that only use core chat completions. These are the items that most commonly cause problems:
- Model name mismatch. Cerebras uses its own model identifiers. Any routing logic or config keyed on
gpt-4o,claude-3-5-sonnet, or other provider-specific strings needs a matching Cerebras model ID. Keep model names in environment variables or config files, not scattered through application code. - Unsupported parameters. Some fields the OpenAI API accepts —
logprobs,n > 1, frequency/presence penalties — may be silently ignored or rejected. Audit the parameters your code sends and test each one against Cerebras explicitly. - Rate limits. Cerebras enforces its own RPM/TPM limits, separate from OpenAI's. A burst workload that sailed under OpenAI's quotas might hit HTTP 429 against Cerebras. Read the
Retry-Afterresponse header and back off, or add a fallback provider (more below). - Confirm base URL in Cerebras docs. Provider infrastructure changes. Do not hard-code a base URL found in a third-party article — read the current value from the Cerebras Inference documentation directly. Store it as an environment variable so you can update it without a code deployment.
Why Cerebras + OpenAI SDK is a strong pairing
Cerebras Inference is built on wafer-scale compute that can generate tokens at throughputs significantly higher than GPU-based inference. In practice this means:
- Low time-to-first-token (TTFT) for latency-sensitive features — chat UIs, real-time summarization, live coding assistants.
- High sustained tokens-per-second for bulk workloads — batch document processing, evaluation pipelines, data extraction.
- Compatible wire format so you do not pay an integration tax — if your stack already speaks OpenAI, you get Cerebras's speed for the cost of three environment variable changes.
The detailed walkthrough of the Cerebras API — keys, pricing tiers, rate limits, model tradeoffs — is in the Cerebras API guide. For the broader picture of why so many providers converge on the same wire format, see OpenAI-compatible API.
Racing and falling back Cerebras behind a gateway
Pointing the OpenAI SDK directly at Cerebras is the right first step. The problem is that it hard-codes a single provider: when Cerebras rate-limits you, a specific model isn't available, or you want to benchmark against another inference backend, you're back to editing application code. A gateway separates provider selection from application logic.
That is what flo2 is built for. flo2 is a developer-first LLM gateway with zero token markup — you bring your own Cerebras key (and keys for OpenAI, Anthropic, Groq, Gemini, DeepInfra, Mistral, xAI, and OpenRouter) and pay each provider directly. A single flo2 key, usable through an OpenAI-compatible and Anthropic-compatible endpoint, routes each request to the provider that is cheapest or fastest for the task, with fallback chains and AI racing so a Cerebras 429 or quota wall rolls over to another provider instead of surfacing as an error to your users.
import os
from openai import OpenAI
# One base URL, one key — flo2 routes to Cerebras (or wherever is best)
client = OpenAI(
base_url="https://flo2.com/v1",
api_key=os.environ["FLO2_API_KEY"],
)
resp = client.chat.completions.create(
model="auto", # or pin to a specific Cerebras model; flo2 falls back for you
messages=[
{"role": "user", "content": "Summarize this stack trace."},
],
)
print(resp.choices[0].message.content)
Because flo2 exposes the same OpenAI-compatible surface you just used against Cerebras, switching is — again — a base_url and api_key change. You get Cerebras's throughput when it is the best fit, automatic fallback when it is not, AI racing to whoever responds first, A/B testing with an LLM judge, opt-in response caching, and real per-call cost accounting across every provider in one view. Free during Beta. Get Cerebras speed with multi-provider resilience at flo2.