2026-06-03 · flo2 blog

Groq Rate Limits: RPM/TPM, 429s & How to Scale Past Them

The Groq API is fast — genuinely, impressively fast — but groq rate limits are the first wall every developer hits once a project leaves the prototype stage. Whether you're on the free tier watching 429s pile up in your logs, or on a paid plan and struggling to understand why a traffic spike caused failures, this guide explains exactly how Groq rate limiting works, how to read the signals it sends, and how to scale past a single key's ceiling without redesigning your whole stack.

How Groq Rate Limiting Works

Groq enforces limits at multiple dimensions simultaneously. Understanding the shape of the system — even without knowing the exact current numbers — helps you design around it correctly.

The three limit axes: RPM, TPM, and daily caps

Groq rate limits are typically expressed across three axes:

Cross any one of these and Groq returns a 429 Too Many Requests response. The limit you hit first depends on your workload: request-heavy pipelines (many small calls) tend to hit RPM first; prompt-heavy workloads (long system prompts, large context windows) tend to exhaust TPM first; sustained overnight batch jobs often find the daily ceiling.

Free tier vs. paid tier

Groq offers a free tier with real, usable capacity — it's a genuine way to build and test. But the free tier's limits are meaningfully tighter than paid plans, and daily caps are more aggressive. If you are on the free tier and hitting 429s regularly, that's expected behavior for a production-scale workload, not a bug. Moving to a paid plan loosens both the per-minute and the daily limits considerably.

The exact numbers for each tier are not reproduced here because they change as Groq adjusts capacity and pricing. Always check Groq's console and their current limits documentation for the figures that apply to your account and the model you're calling. A number from a blog post — including this one — can be stale by next quarter.

Per-model limits

Groq applies limits per model, not just per account. A faster or more popular model may have different headroom than a smaller one. Your account might have generous overall limits, but one specific model could be more constrained, especially if it's under heavy demand across the platform. When a particular model keeps returning 429s while others don't, the per-model ceiling is likely the cause. Check the limits page filtered by the model ID you're using.

Reading the 429 Response and Its Headers

When Groq rate-limits you, the HTTP response tells you more than just "stop." Reading it correctly lets you react precisely rather than guessing.

The response will have status code 429. The body typically contains a JSON error object with a message field that names the limit that was exceeded (RPM, TPM, or daily), which is useful for tuning. More important is the retry-after response header:

HTTP/1.1 429 Too Many Requests
retry-after: 12
x-ratelimit-limit-requests: 30
x-ratelimit-limit-tokens: 6000
x-ratelimit-remaining-requests: 0
x-ratelimit-remaining-tokens: 0
x-ratelimit-reset-requests: 2s
x-ratelimit-reset-tokens: 12s

{
  "error": {
    "message": "Rate limit reached for model `<model-id>` in organization `org_xxx`: Limit 30, Used 30, Requested 1.",
    "type": "tokens",
    "code": "rate_limit_exceeded"
  }
}

The x-ratelimit-* headers tell you your current limits, remaining budget, and when each dimension resets. When retry-after is present, use it — it is the authoritative wait time, computed server-side from exactly when your window expires. Do not guess a fixed sleep duration; honor the header.

Tactics to Live Within Groq's Limits

Before you need more capacity, there are a handful of techniques that let you extract significantly more usable throughput from the limits you already have.

Exponential backoff with jitter

The reactive layer: when a 429 arrives, wait before retrying — and wait in a way that doesn't create a synchronized retry storm. Exponential backoff doubles the wait on each failure; jitter randomizes it so clients that all hit the limit at the same moment don't all hammer the endpoint again at the same moment. Always prefer the retry-after header when it's present.

import random, time, requests

def call_with_backoff(url, headers, payload, max_retries=5):
    base, cap = 1.0, 30.0
    for attempt in range(max_retries + 1):
        resp = requests.post(url, headers=headers, json=payload, timeout=60)
        if resp.status_code == 200:
            return resp.json()
        if resp.status_code != 429:
            resp.raise_for_status()   # non-retryable error — fail fast
        if attempt == max_retries:
            resp.raise_for_status()   # out of retries
        retry_after = resp.headers.get("retry-after")
        if retry_after:
            delay = float(retry_after) + random.uniform(0, 0.5)
        else:
            delay = random.uniform(0, min(cap, base * (2 ** attempt)))
        time.sleep(delay)

Concurrency limits and request pacing

Backoff is reactive. Pacing is proactive. Cap the number of in-flight requests with a semaphore or worker pool, and queue overflow rather than letting it fire simultaneously. Even a short fixed gap between request starts (say, 100ms) converts a burst that trips an RPM ceiling into a smooth stream that stays under it. For token-heavy workloads, estimate the tokens in a batch before sending and delay when you're close to the TPM ceiling — rough math here is much better than no math.

Cache repeated prompts

The request that doesn't go to Groq doesn't count against your limits. Many production workloads repeat the same prompts more than you'd expect — evaluation runs, retries, shared system prompts, templated queries. A response cache in front of your Groq calls removes these from the quota entirely. See our guide on LLM response caching for implementation patterns.

Tune request size

TPM limits respond directly to prompt size. Trim system prompts, use shorter model IDs for simple tasks, and avoid passing unnecessary context. On the free tier especially, every token saved is real budget preserved. For the deeper treatment of all the ways to handle 429s across any LLM provider, see fixing LLM 429 errors.

Scaling Past a Single Key: Multiple Keys and Automatic Fallback

Every technique above helps you get more from one key. But the fundamental limit of a single key is still a single quota. When your workload grows past what one Groq account can serve — or when you need guaranteed availability, not just higher throughput — the answer is to stop depending on a single target.

Multiple Groq keys

If your load is split across multiple Groq accounts (each with their own limits), the effective ceiling becomes the sum of those limits. A simple round-robin across two keys roughly doubles your RPM and TPM budget. The catch is orchestration: you need to track which key is currently throttled, route away from it, and route back once the window resets. Done in application code, this logic tends to get scattered across every service that calls the API.

Automatic provider fallback through a gateway

The more durable pattern is to define a fallback chain: when Groq returns a 429, the request automatically reroutes to another provider — Cerebras, DeepInfra, Fireworks, or any other host that serves a compatible model — without failing and without your application code knowing it happened. From the caller's perspective, the request succeeded. The 429 was handled in the routing layer.

SetupWhat happens on a Groq 429Effective ceiling
Single Groq key (direct)Request fails; your retry logic firesThat key's RPM/TPM limit
Multiple Groq keys, manual rotationYou route to next key; fails if all are tappedSum of all keys' limits
Groq + fallback provider via gatewayGateway reroutes to next healthy provider transparentlyGroq limits + fallback provider's limits combined

The fallback chain approach also handles the harder case: what if Groq is down, slow, or under maintenance? A fallback to another provider means Groq's operational state is no longer your availability risk. For the full design pattern, the Groq API guide covers how to structure Groq calls, and the broader LLM 429 error guide walks through fallback chain design in detail.

Where flo2 Fits

flo2 is a developer-first LLM gateway built around exactly this problem. You bring your own provider keys — Groq, Cerebras, DeepInfra, OpenAI, Anthropic, Gemini, Mistral, and others — and register them once. flo2 exposes a single OpenAI-compatible endpoint; define a fallback chain and it routes each request to the first healthy, non-rate-limited provider in the chain. When Groq returns a 429, flo2 automatically retries on the next provider without surfacing the failure to your application. Because you bring your own keys, flo2 adds zero token markup — you pay each provider at their list price with no aggregator cut on top. It is free during Beta.

If you're ready to stop architecting around a single key's ceiling, start with flo2 and register your Groq key alongside a fallback provider in under five minutes.

One key, every model — zero markup.
Bring your own provider keys. flo2 routes to the cheapest, fastest model with fallback, racing and true cost accounting — free during Beta.
Get your flo2 key →
© 2026 flo2.com — the zero-markup LLM gateway & router. flow → to