Cerebras Inference API Guide: Speed, Models, Pricing & Setup
When a model answer paints onto the screen so fast it feels pre-rendered, the hardware underneath is doing something unusual. That is the pitch of the Cerebras API: Cerebras Inference runs open-weight models on wafer-scale silicon engineered to emit tokens at speeds general-purpose GPU endpoints rarely touch. Cerebras isn't a model lab — it's an inference platform that serves popular open-weight models (the Llama family among them) at very high tokens-per-second. For developers building anything latency-sensitive — autocomplete, voice agents, real-time chat, agent loops, high-volume pipelines — that throughput is the whole reason to reach for it. This guide covers what Cerebras Inference is, how to get a key and make your first call, what it is genuinely good at, and how to think about pricing and limits without getting burned when you scale.
One ground rule up front: exact model names, prices, and rate limits are deliberately left to Cerebras's own docs. That lineup moves fast, and a number that's correct today can be stale next quarter. Treat the official Cerebras models, pricing, and limits pages as the source of truth — this guide gives you the shape, not hard-coded figures.
What is the Cerebras Inference API?
Cerebras runs inference on the Wafer-Scale Engine — a single, enormous chip rather than a rack of discrete GPUs wired together. The bet is that keeping the weights and the whole generation step on one piece of silicon removes the inter-chip communication overhead that throttles token-by-token decoding elsewhere. The upshot for you is very high tokens per second and low time-to-first-token on the models Cerebras Cloud hosts. You don't manage any hardware; you call a hosted API and get the speed.
What Cerebras serves is open-weight models — the Llama family is a headline draw (hence the steady stream of searches for a Cerebras Llama endpoint), alongside other open models that come and go as the ecosystem shifts. Cerebras doesn't train these models; it makes them fast. Because the exact catalog changes regularly, don't hard-code a model assumption from a blog post — check Cerebras's current models page for what's live, its context window, and its status before you wire it into anything.
The other thing that makes Cerebras easy to adopt: the Cerebras Inference API exposes an OpenAI-compatible surface. If your code already speaks to /v1/chat/completions, moving a call to Cerebras is mostly a base URL and an API key swap — no new SDK, no rewrite.
Getting started with the Cerebras API
Three things stand between you and your first token: get a key, point your client at the Cerebras endpoint, and name a model.
Create a Cerebras API key
Sign up for Cerebras Cloud and generate a Cerebras API key from the API keys section of the developer dashboard. Treat it like any other secret — load it from an environment variable, never commit it to source control, and rotate it if it leaks. The examples below read it from CEREBRAS_API_KEY.
export CEREBRAS_API_KEY="csk_your_key_here"
The OpenAI-compatible base URL
Cerebras serves its OpenAI-compatible endpoint under a versioned /v1 path. The exact host can change, so confirm the current base URL in the Cerebras docs rather than copying one from a third-party post — then point any OpenAI-style client at it, pass your Cerebras key as the bearer token, and set model to a current Cerebras model ID. Here's the shape of a minimal curl call against chat completions:
curl <cerebras-base-url>/chat/completions \
-H "Authorization: Bearer $CEREBRAS_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "<current-cerebras-model-id>",
"messages": [
{"role": "user", "content": "Explain wafer-scale inference in one sentence."}
]
}'
Replace <cerebras-base-url> with the verified base URL from the docs and <current-cerebras-model-id> with a real ID from Cerebras's models page — for example a current Llama variant. Request and response shapes mirror the OpenAI Chat Completions API, so fields like temperature, max_tokens, and stream behave as you'd expect.
Cerebras with the Python openai client
Because the API is OpenAI-compatible, the official openai Python package talks to Cerebras with two overrides — base_url and api_key. No Cerebras-specific library is required, which is exactly why "Cerebras Python" usually just means "the OpenAI SDK pointed at Cerebras."
from openai import OpenAI
import os
client = OpenAI(
base_url=os.environ["CEREBRAS_BASE_URL"], # the verified Cerebras /v1 URL
api_key=os.environ["CEREBRAS_API_KEY"],
)
resp = client.chat.completions.create(
model="<current-cerebras-model-id>", # from Cerebras's models page
messages=[
{"role": "user", "content": "Give me three uses for fast inference."}
],
)
print(resp.choices[0].message.content)
Streaming works identically — pass stream=True and iterate the chunks. On Cerebras the payoff is especially visible, since a high token rate paints a streamed answer in a fraction of the wall-clock time a slower host needs. Cerebras also publishes a native SDK, but the OpenAI-compatible path is usually the fastest way to drop it into an existing codebase.
stream = client.chat.completions.create(
model="<current-cerebras-model-id>",
messages=[{"role": "user", "content": "Write a haiku about speed."}],
stream=True,
)
for chunk in stream:
delta = chunk.choices[0].delta.content
if delta:
print(delta, end="", flush=True)
What Cerebras is great at
Cerebras's advantage is narrow and deep: it's about speed, not about hosting every model or having the absolute lowest per-token price on every SKU. Lean into it where speed is the product.
- Latency-critical, user-facing paths. Autocomplete, inline suggestions, voice assistants, and live chat all live or die on time-to-first-token and how fast the rest streams. Very high tokens/sec is the difference between "instant" and "laggy" here.
- High-throughput batch work. Classification, extraction, summarization, and evaluation over large datasets finish sooner when each call returns faster — the gains compound across thousands of requests.
- Agent loops and tool-calling chains. When a single user action triggers several sequential model calls, per-call latency stacks up. A fast host keeps multi-step agents feeling responsive instead of sluggish.
- Reasoning models that emit many tokens. Models that "think" out loud generate long intermediate chains; on slow hardware that thinking is a visible wait, while a fast host can make even verbose reasoning feel near-instant. Open-weight, Llama-class stacks fit here too — keep the models you already use and simply run them faster.
One honest caveat: speed claims are workload-dependent. Tokens-per-second and TTFT shift with prompt shape, output length, region, and load, so benchmark Cerebras from your own environment with your own prompts rather than trusting any headline number. For the broader playbook on where milliseconds hide, see our guide to reduce LLM latency.
Cerebras pricing and rate limits
Cerebras's commercial model is conventional per-token pricing: you pay separately for input and output tokens, and the rate varies by model — larger models cost more per token than smaller ones. There's also typically a free or evaluation tier that lets you build and test without committing, which is part of why Cerebras shows up in "fast free inference" discussions.
That free tier — and paid tiers too — come with rate limits, usually expressed as some combination of:
- RPM — requests per minute.
- TPM — tokens per minute.
- RPD / TPD — daily request and token ceilings on some tiers.
Cross any of those and the API returns an HTTP 429, the standard "too many requests" signal. Limits differ per model and per account tier, and they change as Cerebras adjusts capacity — so the only reliable figures are on Cerebras's official pricing and limits pages. Don't bake a specific RPM, TPM, or per-token price into your assumptions from a third-party source; verify the current numbers before you plan capacity.
The practical implication for anything beyond a demo: plan for the limit, not around it. Respect Retry-After when you get a 429, back off with jitter, and queue work so a burst doesn't slam the ceiling.
Staying reliable under Cerebras's rate limits
Here's the tension every Cerebras user eventually hits: it's exceptionally fast, but its rate limits are real, and a traffic spike that pushes you past TPM will start returning 429s exactly when you can least afford it. Hard-wiring your app to a single Cerebras endpoint makes that limit your app's limit — and the irony is that the fastest provider becomes a hard ceiling the moment demand outruns your allocation.
The clean fix is to put Cerebras behind an LLM gateway with automatic fallback. Instead of calling Cerebras directly, your code calls one stable endpoint; the gateway routes to Cerebras for its speed and, the moment Cerebras returns a 429 (or slows past your tolerance), transparently retries the same request on another provider — another host of the same open-weight model, or a different model entirely. Your users never see the rate limit; they just see a response.
That pattern gives you Cerebras's throughput as the fast path and a safety net for the tail, without scattering provider-specific retry logic across your services. For latency-critical apps especially, fallback is what turns "fast most of the time" into "fast and dependable."
flo2 is a developer-first, bring-your-own-key LLM gateway built for exactly this. You add your own Cerebras key (plus OpenAI, Anthropic, Gemini, Groq, DeepInfra, Mistral, xAI, OpenRouter, and more) and pay each provider directly with zero markup on tokens — flo2 doesn't take a per-token cut, which makes it a genuine zero-markup OpenRouter alternative. One OpenAI- and Anthropic-compatible key routes each request to the cheapest or fastest model and falls back automatically when a provider is rate-limited, so Cerebras's speed stays a feature instead of a single point of failure. It's free during Beta, so you can wire Cerebras in behind a fallback and start measuring today.