2026-06-03 · flo2 blog

xAI Grok API Guide: Models, Pricing & OpenAI-Compatible Setup

If you want to call xAI's Grok models from your own code, the good news is that there's very little new to learn. The Grok API exposes an OpenAI-compatible surface, so the same request shapes, SDKs, and patterns you already use for /v1/chat/completions work against it with a base URL and key swap. This guide walks through the practical parts a developer actually needs: what the Grok model family is generally known for, how to get a Grok API key from the xAI console, how to make your first call against the https://api.x.ai/v1 endpoint with both curl and the Python openai client, how to think about Grok API pricing at a high level, and how to route Grok behind a gateway alongside other providers.

One ground rule first: xAI's model lineup, context limits, and prices move quickly — IDs get added, renamed, and redirected on a regular cadence. So this article avoids hard-coding specific model IDs, token counts, or dollar figures. Treat xAI's official models and pricing pages (docs.x.ai) as the single source of truth, and use the patterns here as the durable part that won't rot.

What is the Grok API?

Grok is the family of large language models built by xAI. The x ai API (often searched as the "xai api" or "x ai api") is the hosted interface that lets your application send prompts to those models and get completions back over HTTP — no model hosting, no GPUs to manage on your side. You authenticate with a bearer token, name a model, and send messages.

The detail that makes adoption painless is compatibility. The Grok API is OpenAI-compatible: requests and responses mirror the OpenAI Chat Completions format, which is why so many developers search specifically for "grok openai compatible." If your codebase already speaks the OpenAI dialect, pointing it at Grok is mostly a configuration change. If you're new to that idea, our explainer on the OpenAI-compatible API pattern covers why so many providers converge on the same schema and what that buys you.

What Grok is generally known for

Keeping this factual rather than promotional: the Grok line is typically positioned around a few characteristics. Verify the current specifics for any given model on xAI's docs, but in broad strokes the family is associated with:

Large context windows. Recent Grok models support very large context, which suits long documents, big codebases, and multi-turn agent transcripts. The exact window size differs per model — check the models page for the figure on the variant you pick.
Strong reasoning. xAI ships reasoning-oriented variants that "think" before answering, aimed at multi-step problems, math, and code. Some variants expose reasoning-heavy and faster non-reasoning modes; the right one depends on whether you're optimizing for depth or latency.
Agentic tool calling. Several Grok variants are tuned for function/tool calling, which matters if you're building agents that invoke external tools in a loop.
Multimodal input on some models. Vision (image) input is available on certain Grok models. Confirm format and size constraints in the docs before relying on it.

Because xAI redirects and deprecates model IDs on a schedule, don't bake a specific Grok model name from a blog post into production. Read the current models page, pick an ID that's live, and note its context window and capabilities there.

Getting a Grok API key in the xAI console

Three steps stand between you and your first token: create a key, point your client at the endpoint, and name a model.

To get a Grok API key, sign in to the xAI developer console, open the API keys section, and create a new key. xAI's API is billed separately from any consumer Grok subscription, so you may need to set up billing or credits before keys will produce completions — check the console's billing area. Treat the key like any other secret: load it from an environment variable, never commit it, and rotate it if it leaks. The examples below read it from XAI_API_KEY.

export XAI_API_KEY="xai-your-key-here"

The OpenAI-compatible base URL

Grok's OpenAI-compatible endpoint lives under a versioned path at https://api.x.ai/v1. Point any OpenAI-style client at that base URL, pass your xAI key as the bearer token, and set model to a current Grok model ID. Here's the shape of a minimal curl call against chat completions:

curl https://api.x.ai/v1/chat/completions \
  -H "Authorization: Bearer $XAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "<current-grok-model-id>",
    "messages": [
      {"role": "system", "content": "You are a concise assistant."},
      {"role": "user", "content": "Explain what an LLM gateway does in one sentence."}
    ]
  }'

Replace <current-grok-model-id> with a real ID from xAI's models page. Because the surface follows the OpenAI Chat Completions schema, familiar fields like temperature, max_tokens, stream, and tools behave the way you'd expect.

Calling Grok with the Python openai client

Since the API is OpenAI-compatible, the official openai Python package talks to Grok with just two overrides — base_url and api_key. No xAI-specific SDK is required, which is exactly why "x ai api" tutorials so often just reuse the OpenAI SDK.

from openai import OpenAI
import os

client = OpenAI(
    base_url="https://api.x.ai/v1",
    api_key=os.environ["XAI_API_KEY"],
)

resp = client.chat.completions.create(
    model="<current-grok-model-id>",   # from xAI's models page
    messages=[
        {"role": "user", "content": "Give me three uses for a long context window."}
    ],
)

print(resp.choices[0].message.content)

Streaming works identically — pass stream=True and iterate the chunks:

stream = client.chat.completions.create(
    model="<current-grok-model-id>",
    messages=[{"role": "user", "content": "Write a haiku about debugging."}],
    stream=True,
)

for chunk in stream:
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="", flush=True)

If you're using a framework that already targets OpenAI — LangChain, LlamaIndex, the Vercel AI SDK, or your own thin wrapper — you typically configure the same two values (base URL and key) and keep the rest of your code untouched.

Grok API pricing, conceptually

xAI uses conventional per-token pricing: you pay separately for input (prompt) tokens and output (completion) tokens, and the rate varies by model — more capable or larger-context models generally cost more per token than lighter ones. Reasoning variants can also consume more output tokens than they appear to, because the model's internal "thinking" counts toward billed output on many such models. That's worth modeling if you run reasoning-heavy prompts at volume.

Beyond the base rates, a few cost levers are worth knowing about generically (confirm availability and terms on xAI's docs, since these features and their discounts change):

Prompt/context caching. Providers increasingly offer reduced pricing for repeated prompt prefixes. If Grok exposes this and your prompts share a large stable preamble, it can cut input costs meaningfully.
Variant choice. Picking a faster, cheaper variant for easy requests and reserving a heavier reasoning model for hard ones is the single biggest lever on spend.
Output discipline. Tight max_tokens and prompts that discourage rambling keep the billed-output side in check.

For exact numbers — the only ones you should plan capacity against — read xAI's official Grok API pricing page. Don't trust a per-token figure quoted in a third-party post (including this one); verify it at the source before you commit a budget.

Routing Grok behind a gateway alongside other providers

Calling Grok directly is fine for a prototype. In production, hard-wiring your app to a single endpoint creates the usual problems: a rate-limit spike returns 429s at the worst moment, a provider incident takes your feature down, and comparing Grok against other labs means scattering provider-specific clients and keys across your services.

The cleaner architecture is to put Grok behind an LLM gateway. Your code calls one stable endpoint; the gateway owns provider keys, routing, retries, and fallback. Because Grok is already OpenAI-compatible, it slots into a gateway that speaks the same dialect without special-casing. With that layer in place you can:

Fail over automatically. If Grok returns a 429 or an error, the gateway transparently retries the same request on another model — so users see a response, not an outage.
Route by cost or speed. Send cheap/easy requests to a lighter model and hard ones to a Grok reasoning variant, without branching logic in your app.
Compare fairly. Run the same task across Grok and competing models and judge which actually fits the job, instead of guessing from marketing.
Account for spend per call. A gateway can attribute true per-call cost across providers, which is hard to reconstruct from separate provider invoices.

flo2 is a developer-first, bring-your-own-key LLM gateway built for exactly this. You add your own xAI key (plus OpenAI, Anthropic, Gemini, Groq, Cerebras, DeepInfra, Mistral, and OpenRouter) and pay each provider directly with zero markup on tokens — flo2 doesn't take a per-token cut, which makes it a genuine zero-markup OpenRouter alternative. One OpenAI- and Anthropic-compatible key routes each request to the cheapest or fastest model, falls back automatically when a provider is rate-limited, and can race or A/B models with an LLM judge so you can measure model–task fit rather than guess. It's free during Beta, so you can wire Grok in behind smart routing and fallback and start measuring today.

One key, every model — zero markup.

Bring your own provider keys. flo2 routes to the cheapest, fastest model with fallback, racing and true cost accounting — free during Beta.

Get your flo2 key →