2026-06-03 · flo2 blog

Fixing LLM 429 Rate Limit Errors: Backoff, Retries & Fallback

You shipped a feature, traffic picked up, and now your logs are full of HTTP 429. An LLM 429 rate limit error means the provider accepted your request, looked at how fast you're sending, and decided you're over quota — so it rejected you instead of running the model. It's not a bug in your code and it's rarely a billing problem; it's a throttle. The good news is that 429s are the most predictable failure an LLM API throws at you, and a handful of well-known client behaviors — backoff with jitter, respecting Retry-After, a concurrency cap, and spreading load across keys — fix the overwhelming majority of them.

Why you get an LLM 429 rate limit in the first place

Every major provider (OpenAI, Anthropic, Google, and the rest) caps how much you can send in a rolling window. A 429 Too Many Requests means you crossed one of those ceilings. There are usually two separate limits running at once, and you can trip either one:

RPM and TPM are the two you'll meet first (some providers add tokens-per-day, concurrent-request, or batch quotas on top). A few more things shape when the wall appears:

Read the rate-limit response headers before you retry

Don't guess how long to wait — the provider usually tells you. Rate-limited responses typically carry headers describing the limit and when it resets. Names vary by provider, but the shapes are consistent:

The practical rule: if a 429 carries Retry-After, sleep for exactly that long (plus a little jitter) rather than using your own backoff curve — the provider knows when the window resets and your exponential schedule is just a guess. Fall back to computed backoff only when the header is absent.

Handle LLM rate limits with exponential backoff and jitter

The correct client behavior for a transient 429 is to wait and retry — but how you wait matters enormously. A fixed delay or, worse, an immediate retry just hammers an already-saturated endpoint. Two principles fix this:

Here's a clean, dependency-free implementation in Python. It retries only on 429 and 5xx, respects Retry-After when present, uses exponential backoff with full jitter otherwise, and caps the number of attempts so a hard limit doesn't loop forever:

import random
import time

import requests

RETRYABLE = {429, 500, 502, 503, 529}


def call_with_backoff(url, headers, payload, max_retries=5, base=1.0, cap=30.0):
    """POST with exponential backoff + full jitter; honors Retry-After."""
    for attempt in range(max_retries + 1):
        resp = requests.post(url, headers=headers, json=payload, timeout=60)

        if resp.status_code == 200:
            return resp.json()

        # Client errors other than 429 will never succeed on retry — fail fast.
        if resp.status_code not in RETRYABLE:
            resp.raise_for_status()

        if attempt == max_retries:
            resp.raise_for_status()  # out of retries

        # Prefer the provider's own guidance when it tells us how long to wait.
        retry_after = resp.headers.get("retry-after")
        if retry_after is not None:
            delay = float(retry_after) + random.uniform(0, 0.5)
        else:
            # Exponential backoff with full jitter: sleep in [0, min(cap, base * 2**attempt)].
            delay = random.uniform(0, min(cap, base * (2 ** attempt)))

        time.sleep(delay)

    raise RuntimeError("unreachable")

The two details people skip: fail fast on non-retryable errors (a 400 bad request or 401 bad key will fail identically on every retry — looping just wastes time and money), and always cap the attempts so a sustained rate limit surfaces as a real error instead of an infinite stall.

Stay under TPM with a concurrency cap and a request queue

Backoff is reactive — it cleans up after you've already been throttled. The better fix is to not exceed the limit in the first place. For sustained throughput, the single most effective control is a concurrency cap: a ceiling on how many requests are in flight at once, with everything else waiting in a queue.

This works because TPM and RPM are really about rate, and rate is roughly concurrency times per-request duration. Cap the in-flight requests with a semaphore sized to your token budget and you replace a sawtooth of bursts and 429s with smooth, predictable throughput — a token-aware queue (estimate tokens per request, admit only while the running sum fits the window) does the same job more precisely.

Scaling past a single key: load balancing and fallback

Backoff and a concurrency cap get you the most out of one key. But a single key has a single quota, and once your real demand exceeds that quota, no amount of polite retrying creates more capacity — you're just queuing against a wall. At that point you need more headroom, which means more than one key or more than one provider.

None of this is novel — it's the same load-balancing, failover, and caching plumbing every distributed system grows. The catch is that building it inside your application means juggling multiple keys, health-tracking providers, and re-testing the whole retry-and-fallback dance on every change, in every service that calls a model. For more on where that logic belongs, see what is an LLM gateway, and for the failover and racing mechanics in depth, LLM fallback and racing.

Stop fighting 429s in application code

Handling LLM rate limits well is a solved problem — backoff with jitter, respect Retry-After, cap concurrency, spread load across keys, fall back across providers, and cache what repeats. The question is just where that logic lives. A gateway sits between your app and the providers and owns all of it as configuration, so your code makes one call and the gateway decides which key and provider it actually lands on.

flo2 is a developer-first LLM gateway built around exactly this: bring your own provider keys (OpenAI, Anthropic, Gemini, Groq, Cerebras, DeepInfra, Mistral, xAI, OpenRouter), and route every request through one OpenAI- and Anthropic-compatible key. You get fallback chains that auto-retry on another provider or key when one returns a 429, load-balancing across keys to multiply your effective rate limit, response caching to cut request volume, and true cost accounting — all with zero token markup, since you pay the providers directly. It's the zero-markup OpenRouter alternative, and it's free during Beta.

One key, every model — zero markup.
Bring your own provider keys. flo2 routes to the cheapest, fastest model with fallback, racing and true cost accounting — free during Beta.
Get your flo2 key →
© 2026 flo2.com — the zero-markup LLM gateway & router. flow → to