LLM Retries Done Right: Exponential Backoff & Jitter
Every LLM API call is a network request to a busy, shared, occasionally-overloaded remote service — so it will fail sometimes, and your code needs a plan for when it does. A good LLM retry strategy is the difference between a blip the user never notices and a 500 on your own endpoint. But "just retry it" is where most implementations go wrong: they retry errors they shouldn't, hammer a struggling provider with no backoff, loop forever with no cap, and quietly double-charge users on non-deterministic generations. This guide covers how to retry LLM calls correctly — which errors to retry, exponential backoff with jitter, honoring Retry-After, idempotency, and circuit breakers — and why most of it belongs one layer below your app code.
LLM retry rule #1: which errors to retry, which to fail fast
The first rule of a good LLM retry strategy is that not all errors are worth retrying. A retry only helps when the same request might succeed on a second attempt. If the payload is malformed or the key is invalid, retrying changes nothing — it just burns latency and, on a partially-billed call, money. So before you write any backoff logic, classify the failure.
Retry these (transient)
429 Too Many Requests— a rate limit. The window resets shortly; back off and it clears. (For the full RPM/TPM breakdown, see fixing LLM 429 errors.)500,502,503,529— server-side errors: an overloaded backend, a bad deploy, a capacity crunch on a newly popular model. Transient, and they cluster.- Timeouts — the request took too long or never came back. A fresh attempt can land on a healthy node instead of the degraded one.
- Connection resets / network errors —
ECONNRESET, dropped TLS handshakes, DNS hiccups: pure infrastructure noise, almost always worth a retry.
Do NOT retry these (permanent)
400 Bad Request/422 Unprocessable Entity— a malformed payload or violated constraint (bad parameter, context-length overflow). It fails identically every time; fix the request, don't repeat it.401 Unauthorized/403 Forbidden— a missing, wrong, or revoked key, or a permission you lack. Retrying a bad key just loops the same rejection.- Content-policy refusals — the provider declined on safety grounds. A retry yields the same refusal; this needs a prompt change, not another attempt.
The most common bug in homegrown retry code is a blanket except: retry that swallows a 400 and pounds the endpoint five times for a result that was never going to change. Classify first, retry second.
Exponential backoff with jitter (and a hard attempt cap)
Once you know an error is retryable, how you wait between attempts matters as much as whether you retry at all. Three principles cover it:
- Exponential backoff. Double the delay after each failure — 1s, 2s, 4s, 8s — so you back off fast when a provider is clearly struggling instead of adding to its load.
- Jitter. Add randomness to each delay. Without it, every client that failed at the same instant retries at the same instant — a synchronized "thundering herd" that re-saturates the provider the moment it recovers. Full jitter (a random delay in
[0, backoff]) spreads those retries out. - A max-attempts cap. Always bound the loop. Uncapped retries against a sustained outage heal nothing — they just turn a fast, visible error into a long, silent stall that ties up a worker. Three to five attempts is plenty.
Here's a clean, dependency-free Python implementation that ties classification and backoff together. It retries only the transient statuses, honors Retry-After when present, uses exponential backoff with full jitter otherwise, and caps total attempts so a hard failure surfaces instead of looping:
import random
import time
import requests
RETRYABLE_STATUS = {429, 500, 502, 503, 529}
def call_llm_with_retry(url, headers, payload, *, max_retries=4, base=1.0, cap=30.0):
"""POST an LLM request with classified retries + exponential backoff and jitter."""
for attempt in range(max_retries + 1):
try:
resp = requests.post(url, headers=headers, json=payload, timeout=60)
except (requests.ConnectionError, requests.Timeout):
# Network-level failure (reset, timeout, DNS): transient, retry it.
if attempt == max_retries:
raise
time.sleep(_backoff(attempt, base, cap))
continue
if resp.status_code == 200:
return resp.json()
# Permanent errors (400/401/403/422/...) will never succeed on retry.
if resp.status_code not in RETRYABLE_STATUS:
resp.raise_for_status()
if attempt == max_retries:
resp.raise_for_status() # out of attempts -> surface the error
# Honor the provider's own timing when it tells us when the window resets.
retry_after = resp.headers.get("Retry-After")
if retry_after is not None:
delay = float(retry_after) + random.uniform(0, 0.5)
else:
delay = _backoff(attempt, base, cap)
time.sleep(delay)
raise RuntimeError("unreachable")
def _backoff(attempt, base, cap):
# Full jitter: a random delay in [0, min(cap, base * 2**attempt)].
return random.uniform(0, min(cap, base * (2 ** attempt)))
Always honor Retry-After
When a 429 (or sometimes a 503) carries a Retry-After header, that value is authoritative — the provider is telling you exactly when the window resets. Sleep for that duration (plus a touch of jitter so a fleet of clients doesn't wake in lockstep) rather than your own exponential curve, which is only ever a guess. Fall back to computed backoff only when the header is absent.
Idempotency: the LLM-specific catch most retry code misses
Here's the wrinkle that makes retrying LLM calls different from an ordinary API call. A timeout doesn't always mean the request failed — it can mean the response was lost after the model already ran. Retry it and the provider runs (and bills) the generation again. With most REST APIs you'd send an idempotency key so the server dedupes the duplicate; LLM chat-completion APIs largely don't offer one, which leaves two real problems:
- Double billing. A retried call after a silently-successful first attempt means you paid for two completions and used one — and over a high-traffic path that adds up.
- Non-deterministic output. Generation is stochastic, so the retry returns a different answer than the lost original. If anything downstream caches by request, expects a stable result, or already showed partial output, a divergent retry can corrupt state.
A few practical defenses:
- Key your own work, not just the API call. Wrap the operation in an application-level idempotency key (a hash of the user action + input) and store the first successful result, so a retried job reuses the original output instead of generating fresh.
- Pin the output shape. Use JSON mode or a tool schema so even a divergent retry stays parseable; for extraction or classification, a low temperature keeps the retry's output close to the original.
- Be careful with streaming. Once tokens are streamed to the client you can't un-send them; if a streamed call fails mid-response you usually can't transparently retry. Retry only before the first token, or build explicit replace-the-stream handling.
Circuit breakers: stop retrying a provider that's clearly down
Retries assume failures are independent blips. During a real outage they aren't — the provider is down, so every request burns its full retry budget before giving up, multiplying your latency and your load against a service that has nothing to give. A circuit breaker is the fix.
The pattern is simple: track failures per provider, and when they cross a threshold (say, the last N calls all failed), "open the circuit" — stop sending requests to that target for a cooldown window and fail fast (or, better, route elsewhere). After the cooldown, let one probe request through; if it succeeds, close the circuit and resume; if not, wait again. The payoffs:
- You stop hammering a dead endpoint, sparing the provider and giving it room to recover.
- Failures get fast — an open circuit rejects in microseconds instead of dragging every request through 30 seconds of doomed backoff.
- It pairs with fallback. An open circuit on your primary is the trigger to route to a secondary provider, turning an outage into a transparent reroute.
Together the three layers form a complete picture: retry handles the one-off blip, backoff with jitter keeps retries from making things worse, and the circuit breaker handles the sustained failure no retry can fix.
Push retries and fallback down to the gateway layer
Notice how much logic we just described: error classification, exponential backoff with jitter, Retry-After parsing, idempotency handling, per-provider circuit breakers with cooldowns, and the fallback routing they feed. None of it is your product — it's distributed-systems plumbing, and inside your app it means reimplementing and re-testing the whole retry-and-fallback dance in every service that calls a model.
This is exactly the job of an LLM gateway: a layer between your app and the providers that owns the retry policy as configuration and exposes one stable endpoint. Your code makes a single clean call; the gateway decides how many times to retry, how long to back off, when to trip a breaker, and which provider or key to fall back to. (For the broader picture, see what is an LLM gateway.) Pushing this down pays off:
- Your application code stays trivial — one request, no retry loop, no backoff math, no breaker state.
- Policy is centralized. Tune backoff, attempt caps, or fallback order once and every service inherits it — no fleet-wide redeploys.
- Retries compose with real failover. When a target exhausts its retries or its circuit opens, the gateway reroutes to a different provider or key instead of erroring — something an in-process loop can't do — and logs every attempt so you can see how often your primary fails and what your true blended cost is.
flo2 is a developer-first LLM gateway built around exactly this. Bring your own provider keys (OpenAI, Anthropic, Gemini, Groq, Cerebras, DeepInfra, Mistral, xAI, OpenRouter) and route every request through one OpenAI- and Anthropic-compatible key: you get fallback chains that auto-retry on 429/5xx and skip permanent 4xx errors, cooldowns on unhealthy providers, response caching for repeated requests, and true per-call cost accounting — all with zero token markup, since you pay the providers directly. It's the zero-markup OpenRouter alternative, and it's free during Beta.