2026-06-03 · flo2 blog

Groq API Guide (2026): Setup, Speed, Pricing & Rate Limits

If you've ever watched a response stream in so fast it feels like the text was already written, there's a good chance it ran on the Groq API. Groq isn't another model lab — it's an inference platform that serves popular open-weight models (Llama and others) on custom hardware engineered to emit tokens at speeds that general-purpose GPU endpoints rarely match. For developers building anything latency-sensitive — autocomplete, voice agents, real-time chat, high-volume pipelines — that raw throughput is the whole pitch. This guide covers what Groq is, how to get a key and make your first call, what it's genuinely good at, and how to think about pricing and rate limits without getting burned when you scale.

Throughout, exact model names, prices, and limits are deliberately left for Groq's own docs — that lineup moves fast, and a number that's right today can be stale next quarter. Treat the official pricing, models, and limits pages as the source of truth.

What is Groq?

Groq runs inference on its own custom silicon — the LPU (Language Processing Unit), an architecture purpose-built for the sequential, one-token-at-a-time nature of language model generation rather than the batch-parallel workloads GPUs were designed for. The practical upshot is very high tokens per second and low time-to-first-token on the models it hosts. You don't manage any of that hardware; you call a hosted API and get the speed.

What Groq serves is open-weight models — the Llama family is the headline draw (hence the steady stream of searches for a Groq Llama API), alongside other open models that come and go as the ecosystem shifts. Groq doesn't train these models; it makes them fast. Because the exact catalog changes regularly, don't hard-code a model assumption from a blog post — check Groq's current models page for what's live, its context window, and its status before you wire it into anything.

The other thing that makes Groq easy to adopt: its API is OpenAI-compatible. If your code already speaks to /v1/chat/completions, moving a call to Groq is mostly a base URL and an API key swap — no new SDK, no rewrite.

Getting started with the Groq API

Three steps stand between you and your first token: get a key, point your client at Groq's endpoint, and name a model.

Create a Groq API key

Sign up on Groq's console and generate a Groq API key from the API keys section of the dashboard. Treat it like any other secret — load it from an environment variable, never commit it to source control, and rotate it if it leaks. The examples below read it from GROQ_API_KEY.

export GROQ_API_KEY="gsk_your_key_here"

The OpenAI-compatible base URL

Groq exposes an OpenAI-compatible surface at:

https://api.groq.com/openai/v1

Point any OpenAI-style client at that base URL, pass your Groq key as the bearer token, and set model to a current Groq model ID. Here's a minimal curl call against chat completions:

curl https://api.groq.com/openai/v1/chat/completions \
  -H "Authorization: Bearer $GROQ_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "<current-groq-model-id>",
    "messages": [
      {"role": "user", "content": "Explain an LPU in one sentence."}
    ]
  }'

Replace <current-groq-model-id> with a real ID from Groq's models page — for example a current Llama variant. Request and response shapes mirror the OpenAI Chat Completions API, so fields like temperature, max_tokens, and stream behave as you'd expect.

Groq with the Python openai client

Because the API is OpenAI-compatible, the official openai Python package talks to Groq with two overrides — base_url and api_key. No Groq-specific library required, which is exactly why "Groq Python" usually just means "the OpenAI SDK pointed at Groq."

from openai import OpenAI
import os

client = OpenAI(
    base_url="https://api.groq.com/openai/v1",
    api_key=os.environ["GROQ_API_KEY"],
)

resp = client.chat.completions.create(
    model="<current-groq-model-id>",   # from Groq's models page
    messages=[
        {"role": "user", "content": "Give me three uses for fast inference."}
    ],
)

print(resp.choices[0].message.content)

Streaming works identically — pass stream=True and iterate the chunks. On Groq the payoff is especially visible, since a high token rate paints a streamed answer in a fraction of the wall-clock time a slower host needs.

stream = client.chat.completions.create(
    model="<current-groq-model-id>",
    messages=[{"role": "user", "content": "Write a haiku about speed."}],
    stream=True,
)

for chunk in stream:
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="", flush=True)

What Groq is great at

Groq's advantage is narrow and deep: it's about speed, not about hosting every model or having the absolute lowest per-token price on every SKU. Lean into it where speed is the product.

Latency-critical, user-facing paths. Autocomplete, inline suggestions, voice assistants, and live chat all live or die on time-to-first-token and how fast the rest streams. High tokens/sec is the difference between "instant" and "laggy" here.
High-throughput batch work. Classification, extraction, summarization, and evaluation over large datasets finish sooner when each call returns faster — the gains compound across thousands of requests.
Agent loops and tool-calling chains. When a single user action triggers several sequential model calls, per-call latency stacks up. A fast host keeps multi-step agents feeling responsive instead of sluggish.
Open-weight workloads. If your stack is already built around Llama-class models, Groq lets you keep those models and simply run them faster.

One honest caveat: speed claims are workload-dependent. Tokens-per-second and TTFT shift with prompt shape, output length, region, and load, so benchmark Groq from your own environment with your own prompts rather than trusting any headline number. For the broader playbook on where milliseconds hide, see our guide to reduce LLM latency.

Groq API pricing and rate limits

Groq's commercial model is conventional per-token pricing: you pay separately for input and output tokens, and the rate varies by model — bigger models cost more per token than smaller ones. There's also typically a free tier that lets you build and test without a card, which is a big part of why Groq shows up so often in "free LLM API" roundups.

That free tier — and paid tiers too — come with rate limits, usually expressed as some combination of:

RPM — requests per minute.
TPM — tokens per minute.
RPD / TPD — daily request and token ceilings on some tiers.

Cross any of those and the API returns an HTTP 429, the standard "too many requests" signal. Limits differ per model and per account tier, and they change as Groq adjusts capacity — so the only reliable figures are on Groq's official pricing and rate-limits pages. Don't bake a specific RPM or TPM into your assumptions from a third-party source.

The practical implication for anything beyond a demo: plan for the limit, not around it. Respect Retry-After when you get a 429, back off with jitter, and queue work so a burst doesn't slam the ceiling. If you want the full mechanics of handling these errors gracefully, we cover them in fixing LLM 429 rate limit errors.

Staying reliable under Groq's rate limits

Here's the tension every Groq user eventually hits: it's fast and cheap, but its rate limits are real, and a traffic spike that pushes you past TPM will start returning 429s exactly when you can least afford it. Hard-wiring your app to a single Groq endpoint makes that limit your app's limit.

The clean fix is to put Groq behind an LLM gateway with automatic fallback. Instead of calling Groq directly, your code calls one stable endpoint; the gateway routes to Groq for its speed and, the moment Groq returns a 429 (or slows past your tolerance), transparently retries the same request on another provider — another host of the same open-weight model, or a different model entirely. Your users never see the rate limit; they just see a response.

That pattern gives you Groq's throughput as the fast path and a safety net for the tail, without scattering provider-specific retry logic across your services — the gateway owns the routing rules, fallback order, and per-call accounting in one place.

flo2 is a developer-first, bring-your-own-key LLM gateway built for exactly this. You add your own Groq key (plus OpenAI, Anthropic, Gemini, Cerebras, Mistral, xAI, and more) and pay each provider directly with zero markup on tokens — flo2 doesn't take a per-token cut, which makes it a genuine zero-markup OpenRouter alternative. One OpenAI- and Anthropic-compatible key routes each request to the cheapest or fastest model and falls back automatically when a provider is rate-limited, so Groq's speed stays a feature instead of a single point of failure. It's free during Beta, so you can wire Groq in behind a fallback and start measuring today.

One key, every model — zero markup.

Bring your own provider keys. flo2 routes to the cheapest, fastest model with fallback, racing and true cost accounting — free during Beta.

Get your flo2 key →