2026-06-03 · flo2 blog

Fixing LLM 429 Rate Limit Errors: Backoff, Retries & Fallback

You shipped a feature, traffic picked up, and now your logs are full of HTTP 429. An LLM 429 rate limit error means the provider accepted your request, looked at how fast you're sending, and decided you're over quota — so it rejected you instead of running the model. It's not a bug in your code and it's rarely a billing problem; it's a throttle. The good news is that 429s are the most predictable failure an LLM API throws at you, and a handful of well-known client behaviors — backoff with jitter, respecting Retry-After, a concurrency cap, and spreading load across keys — fix the overwhelming majority of them.

Why you get an LLM 429 rate limit in the first place

Every major provider (OpenAI, Anthropic, Google, and the rest) caps how much you can send in a rolling window. A 429 Too Many Requests means you crossed one of those ceilings. There are usually two separate limits running at once, and you can trip either one:

RPM — requests per minute. A cap on how many calls you make, regardless of size. Fire 200 tiny requests in a second and you'll hit RPM even though you've barely used any tokens.
TPM — tokens per minute. A cap on total tokens (prompt + completion) flowing through in the window. A handful of long-context requests with big prompts or large max_tokens can blow past TPM while your RPM looks fine.

RPM and TPM are the two you'll meet first (some providers add tokens-per-day, concurrent-request, or batch quotas on top). A few more things shape when the wall appears:

Burst vs. sustained. Limits are measured over a window, so a brief spike can trip one even when your one-minute average is comfortably under it. Ten requests bunched into the same 200 ms is a burst; the limiter sees the instantaneous rate, not your tidy per-minute math.
Org, project, and tier. Quotas attach to an account, are often scoped per organization or project, and scale with usage tier. A fresh key on the lowest tier has far less headroom than an aged, high-spend account — which is why code that's fine in production 429s instantly from a sandbox key.
Per-model limits. Limits are usually set per model, and the newest or largest model often has the tightest quota — so moving from a mini model to a frontier one can put you over a ceiling you never noticed.

Read the rate-limit response headers before you retry

Don't guess how long to wait — the provider usually tells you. Rate-limited responses typically carry headers describing the limit and when it resets. Names vary by provider, but the shapes are consistent:

Retry-After — how long to wait before trying again, in seconds (or sometimes an HTTP date). When this is present on a 429, it is the authoritative answer. Honor it.
x-ratelimit-limit-* — your ceiling for a given dimension (requests or tokens).
x-ratelimit-remaining-* — how much of that budget is left in the current window. Watch this trend toward zero and you can slow down before you get rejected.
x-ratelimit-reset-* — when the window refills, as a duration or timestamp.

The practical rule: if a 429 carries Retry-After, sleep for exactly that long (plus a little jitter) rather than using your own backoff curve — the provider knows when the window resets and your exponential schedule is just a guess. Fall back to computed backoff only when the header is absent.

Handle LLM rate limits with exponential backoff and jitter

The correct client behavior for a transient 429 is to wait and retry — but how you wait matters enormously. A fixed delay or, worse, an immediate retry just hammers an already-saturated endpoint. Two principles fix this:

Exponential backoff. Double the wait after each failed attempt (1s, 2s, 4s, 8s…) so you back off fast when the provider is clearly busy.
Jitter. Add randomness to each delay. Without it, every client that hit the limit at the same moment retries at the same moment — a synchronized "thundering herd" that re-saturates the provider the instant it recovers. Jitter spreads those retries out.

Here's a clean, dependency-free implementation in Python. It retries only on 429 and 5xx, respects Retry-After when present, uses exponential backoff with full jitter otherwise, and caps the number of attempts so a hard limit doesn't loop forever:

import random
import time

import requests

RETRYABLE = {429, 500, 502, 503, 529}


def call_with_backoff(url, headers, payload, max_retries=5, base=1.0, cap=30.0):
    """POST with exponential backoff + full jitter; honors Retry-After."""
    for attempt in range(max_retries + 1):
        resp = requests.post(url, headers=headers, json=payload, timeout=60)

        if resp.status_code == 200:
            return resp.json()

        # Client errors other than 429 will never succeed on retry — fail fast.
        if resp.status_code not in RETRYABLE:
            resp.raise_for_status()

        if attempt == max_retries:
            resp.raise_for_status()  # out of retries

        # Prefer the provider's own guidance when it tells us how long to wait.
        retry_after = resp.headers.get("retry-after")
        if retry_after is not None:
            delay = float(retry_after) + random.uniform(0, 0.5)
        else:
            # Exponential backoff with full jitter: sleep in [0, min(cap, base * 2**attempt)].
            delay = random.uniform(0, min(cap, base * (2 ** attempt)))

        time.sleep(delay)

    raise RuntimeError("unreachable")

The two details people skip: fail fast on non-retryable errors (a 400 bad request or 401 bad key will fail identically on every retry — looping just wastes time and money), and always cap the attempts so a sustained rate limit surfaces as a real error instead of an infinite stall.

Stay under TPM with a concurrency cap and a request queue

Backoff is reactive — it cleans up after you've already been throttled. The better fix is to not exceed the limit in the first place. For sustained throughput, the single most effective control is a concurrency cap: a ceiling on how many requests are in flight at once, with everything else waiting in a queue.

This works because TPM and RPM are really about rate, and rate is roughly concurrency times per-request duration. Cap the in-flight requests with a semaphore sized to your token budget and you replace a sawtooth of bursts and 429s with smooth, predictable throughput — a token-aware queue (estimate tokens per request, admit only while the running sum fits the window) does the same job more precisely.

Cap concurrency with a semaphore or worker pool sized to your TPM, not to how many tasks you happen to have.
Queue the overflow rather than firing everything at once; a short wait beats a rejected request and a retry.
Smooth bursts by pacing — even a small fixed gap between request starts turns a spike into a stream the limiter is happy with.

Scaling past a single key: load balancing and fallback

Backoff and a concurrency cap get you the most out of one key. But a single key has a single quota, and once your real demand exceeds that quota, no amount of polite retrying creates more capacity — you're just queuing against a wall. At that point you need more headroom, which means more than one key or more than one provider.

Load-balance across keys. Several keys — across projects, accounts, or providers serving the same model — each carry their own RPM/TPM budget. Distribute requests across them and your effective ceiling is the sum, not any single key's limit.
Fall back automatically on 429. When one key or provider returns a rate limit that won't clear, the request should move to the next healthy target — a different key, or a different provider's equivalent model — instead of failing. A 429 from OpenAI doesn't mean Anthropic or Gemini is busy, so failover across providers turns a hard rate-limit wall into a soft, transparent reroute.
Cache identical requests. The cheapest 429 fix is the request you never send. When the same prompt repeats — common with retries, idempotent jobs, and shared system prompts — a response cache serves it without touching the provider, cutting request and token volume directly and leaving more quota for traffic that actually needs the model.

None of this is novel — it's the same load-balancing, failover, and caching plumbing every distributed system grows. The catch is that building it inside your application means juggling multiple keys, health-tracking providers, and re-testing the whole retry-and-fallback dance on every change, in every service that calls a model. For more on where that logic belongs, see what is an LLM gateway, and for the failover and racing mechanics in depth, LLM fallback and racing.

Stop fighting 429s in application code

Handling LLM rate limits well is a solved problem — backoff with jitter, respect Retry-After, cap concurrency, spread load across keys, fall back across providers, and cache what repeats. The question is just where that logic lives. A gateway sits between your app and the providers and owns all of it as configuration, so your code makes one call and the gateway decides which key and provider it actually lands on.

flo2 is a developer-first LLM gateway built around exactly this: bring your own provider keys (OpenAI, Anthropic, Gemini, Groq, Cerebras, DeepInfra, Mistral, xAI, OpenRouter), and route every request through one OpenAI- and Anthropic-compatible key. You get fallback chains that auto-retry on another provider or key when one returns a 429, load-balancing across keys to multiply your effective rate limit, response caching to cut request volume, and true cost accounting — all with zero token markup, since you pay the providers directly. It's the zero-markup OpenRouter alternative, and it's free during Beta.

One key, every model — zero markup.

Bring your own provider keys. flo2 routes to the cheapest, fastest model with fallback, racing and true cost accounting — free during Beta.

Get your flo2 key →