2026-06-03 · flo2 blog

LLM Retries Done Right: Exponential Backoff & Jitter

Every LLM API call is a network request to a busy, shared, occasionally-overloaded remote service — so it will fail sometimes, and your code needs a plan for when it does. A good LLM retry strategy is the difference between a blip the user never notices and a 500 on your own endpoint. But "just retry it" is where most implementations go wrong: they retry errors they shouldn't, hammer a struggling provider with no backoff, loop forever with no cap, and quietly double-charge users on non-deterministic generations. This guide covers how to retry LLM calls correctly — which errors to retry, exponential backoff with jitter, honoring Retry-After, idempotency, and circuit breakers — and why most of it belongs one layer below your app code.

LLM retry rule #1: which errors to retry, which to fail fast

The first rule of a good LLM retry strategy is that not all errors are worth retrying. A retry only helps when the same request might succeed on a second attempt. If the payload is malformed or the key is invalid, retrying changes nothing — it just burns latency and, on a partially-billed call, money. So before you write any backoff logic, classify the failure.

Retry these (transient)

Do NOT retry these (permanent)

The most common bug in homegrown retry code is a blanket except: retry that swallows a 400 and pounds the endpoint five times for a result that was never going to change. Classify first, retry second.

Exponential backoff with jitter (and a hard attempt cap)

Once you know an error is retryable, how you wait between attempts matters as much as whether you retry at all. Three principles cover it:

Here's a clean, dependency-free Python implementation that ties classification and backoff together. It retries only the transient statuses, honors Retry-After when present, uses exponential backoff with full jitter otherwise, and caps total attempts so a hard failure surfaces instead of looping:

import random
import time

import requests

RETRYABLE_STATUS = {429, 500, 502, 503, 529}


def call_llm_with_retry(url, headers, payload, *, max_retries=4, base=1.0, cap=30.0):
    """POST an LLM request with classified retries + exponential backoff and jitter."""
    for attempt in range(max_retries + 1):
        try:
            resp = requests.post(url, headers=headers, json=payload, timeout=60)
        except (requests.ConnectionError, requests.Timeout):
            # Network-level failure (reset, timeout, DNS): transient, retry it.
            if attempt == max_retries:
                raise
            time.sleep(_backoff(attempt, base, cap))
            continue

        if resp.status_code == 200:
            return resp.json()

        # Permanent errors (400/401/403/422/...) will never succeed on retry.
        if resp.status_code not in RETRYABLE_STATUS:
            resp.raise_for_status()

        if attempt == max_retries:
            resp.raise_for_status()  # out of attempts -> surface the error

        # Honor the provider's own timing when it tells us when the window resets.
        retry_after = resp.headers.get("Retry-After")
        if retry_after is not None:
            delay = float(retry_after) + random.uniform(0, 0.5)
        else:
            delay = _backoff(attempt, base, cap)

        time.sleep(delay)

    raise RuntimeError("unreachable")


def _backoff(attempt, base, cap):
    # Full jitter: a random delay in [0, min(cap, base * 2**attempt)].
    return random.uniform(0, min(cap, base * (2 ** attempt)))

Always honor Retry-After

When a 429 (or sometimes a 503) carries a Retry-After header, that value is authoritative — the provider is telling you exactly when the window resets. Sleep for that duration (plus a touch of jitter so a fleet of clients doesn't wake in lockstep) rather than your own exponential curve, which is only ever a guess. Fall back to computed backoff only when the header is absent.

Idempotency: the LLM-specific catch most retry code misses

Here's the wrinkle that makes retrying LLM calls different from an ordinary API call. A timeout doesn't always mean the request failed — it can mean the response was lost after the model already ran. Retry it and the provider runs (and bills) the generation again. With most REST APIs you'd send an idempotency key so the server dedupes the duplicate; LLM chat-completion APIs largely don't offer one, which leaves two real problems:

A few practical defenses:

Circuit breakers: stop retrying a provider that's clearly down

Retries assume failures are independent blips. During a real outage they aren't — the provider is down, so every request burns its full retry budget before giving up, multiplying your latency and your load against a service that has nothing to give. A circuit breaker is the fix.

The pattern is simple: track failures per provider, and when they cross a threshold (say, the last N calls all failed), "open the circuit" — stop sending requests to that target for a cooldown window and fail fast (or, better, route elsewhere). After the cooldown, let one probe request through; if it succeeds, close the circuit and resume; if not, wait again. The payoffs:

Together the three layers form a complete picture: retry handles the one-off blip, backoff with jitter keeps retries from making things worse, and the circuit breaker handles the sustained failure no retry can fix.

Push retries and fallback down to the gateway layer

Notice how much logic we just described: error classification, exponential backoff with jitter, Retry-After parsing, idempotency handling, per-provider circuit breakers with cooldowns, and the fallback routing they feed. None of it is your product — it's distributed-systems plumbing, and inside your app it means reimplementing and re-testing the whole retry-and-fallback dance in every service that calls a model.

This is exactly the job of an LLM gateway: a layer between your app and the providers that owns the retry policy as configuration and exposes one stable endpoint. Your code makes a single clean call; the gateway decides how many times to retry, how long to back off, when to trip a breaker, and which provider or key to fall back to. (For the broader picture, see what is an LLM gateway.) Pushing this down pays off:

flo2 is a developer-first LLM gateway built around exactly this. Bring your own provider keys (OpenAI, Anthropic, Gemini, Groq, Cerebras, DeepInfra, Mistral, xAI, OpenRouter) and route every request through one OpenAI- and Anthropic-compatible key: you get fallback chains that auto-retry on 429/5xx and skip permanent 4xx errors, cooldowns on unhealthy providers, response caching for repeated requests, and true per-call cost accounting — all with zero token markup, since you pay the providers directly. It's the zero-markup OpenRouter alternative, and it's free during Beta.

One key, every model — zero markup.
Bring your own provider keys. flo2 routes to the cheapest, fastest model with fallback, racing and true cost accounting — free during Beta.
Get your flo2 key →
© 2026 flo2.com — the zero-markup LLM gateway & router. flow → to