2026-05-31 · flo2 blog

LLM Fallback and Racing: Build an LLM API That Never Goes Down

Your LLM app worked perfectly in the demo. Then a model provider had a bad afternoon — a regional outage, a rate-limit spike during a traffic surge, a cluster of 503s — and your product went down with it. If your application talks to exactly one model behind exactly one API key, you have built a single point of failure into the most load-bearing part of your stack. This is the problem LLM fallback solves, and it's the foundation of any LLM API that stays up when individual providers don't.

Why single-provider LLM apps break

Model APIs are remarkably good, but they are still remote services subject to the same failure modes as any other distributed system — plus a few of their own. In production you will eventually hit all of these:

Rate limits (429). You share capacity with everyone else on that provider. A burst of traffic — yours or theirs — pushes you over your tokens-per-minute or requests-per-minute ceiling, and requests start bouncing.
Server errors (500/502/503/529). Overloaded backends, deployments gone wrong, capacity crunches on a newly popular model. These are transient but real, and they cluster.
Regional and full outages. When a frontier lab has an incident, it tends to be correlated and total. No amount of retrying the same endpoint helps when the endpoint itself is dark.
Slow-tail latency. The median response is fine; the p99 is a disaster. A single slow request stuck behind a degraded node can blow your latency budget even when nothing technically "failed."

Each of these has a different correct response, which is exactly why a naive "wrap it in a try/catch and retry" loop is not enough. Retrying a 429 helps; retrying a 400 just wastes time and money. Surviving an outage requires leaving the provider entirely. To handle production traffic you need a small vocabulary of distinct techniques.

Retries vs. fallback vs. load balancing vs. racing

These four terms get used interchangeably, but they are different tools for different failures. Getting the distinction right is most of the battle.

Retries

Re-issuing the same request to the same model after a transient failure. Retries fix blips — a one-off 503, a momentary timeout. They do nothing for a sustained outage or a hard rate limit, and without a cap and backoff they make an overloaded provider worse. Retries are a within-target tactic, not a resilience strategy.

LLM fallback (failover)

When the primary target is exhausted — it has failed after N retries, or returned a terminal error — you move to a different model or provider. This is LLM failover, and it's what gets you through a full outage: GPT-class primary down, so you slide to a Claude or Gemini secondary, then to a self-hosted model as a last resort. Fallback changes the destination; retries do not.

LLM load balancing

Spreading requests proactively across multiple equivalent targets — several API keys, several regions, several providers serving the same model — to stay under per-key rate limits and smooth out capacity. Where fallback is reactive (move after failure), LLM load balancing is preventive (distribute before failure). The two compose well: balance across healthy targets, fall back when one degrades.

LLM racing (hedged requests)

Send the request to two or more models at once — optionally giving your preferred model a head start — and keep whichever responds first, aborting the rest. LLM racing is the only one of these four that attacks slow-tail latency directly, because it doesn't wait for a timeout to fire before trying an alternative. It trades money for speed and reliability.

Designing a fallback chain

A fallback chain is an ordered list of targets, tried top to bottom until one succeeds. Three decisions define a good one.

1. Ordering: cost and quality

Order is policy, not an afterthought. The common pattern is best-quality-first: your strongest model at the top, progressively cheaper or faster alternatives below as graceful degradation — a slightly worse answer beats an error page. The inverse, cheapest-first, treats premium models as an overflow valve you only pay for when the budget option is unavailable. Either is valid; what matters is that you chose deliberately and can reorder as prices and model quality shift.

2. Which errors are retryable vs. terminal

This is the single most important piece of logic in the whole system. Error classification decides whether you retry the same target, skip to the next one, or stop and surface the failure.

Retry the same target on transient errors: 429 (after backoff), 500, 502, 503, 529, and network timeouts. These can succeed on a second attempt.
Skip to the next target when retries are exhausted, or immediately on errors that signal this target is unusable: 401/403 (bad or revoked key), a persistent 429 that won't clear, or model-specific 404s.
Fail fast, do not fall back, on client errors that every provider will reject identically: 400 (malformed request), 422, content-policy refusals, or a context-length overflow. Falling through the entire chain on a 400 just multiplies one bug into five and burns latency for a result you can't change.

3. Backoff and idempotency

Retries should use exponential backoff with jitter so a fleet of clients doesn't synchronize into a thundering herd against a recovering provider. Honor the Retry-After header on 429s when present. And treat every attempt as potentially observable: if a fallback fires after a primary call may have partially succeeded, you want at-most-one user-visible completion — send an idempotency key or design the call so a duplicate is harmless.

Conceptually, the routing logic is small:

route "chat-resilient":
  targets:
    - openai/gpt-5.1         # primary: best quality
    - anthropic/claude-opus  # fallback 1: different provider, no correlated outage
    - self-hosted/llama-70b  # fallback 2: last resort, always-on

  retry:    { max: 2, backoff: exponential, jitter: true }
  retry_on:    [429, 500, 502, 503, 529, timeout]
  next_on:     [401, 403, 404]   # this target is dead, move on
  fail_on:     [400, 422, 413]   # terminal, surface to caller

for target in targets:
    for attempt in 0..retry.max:
        resp = call(target, request)
        if resp.ok:            return resp
        if resp.status in fail_on:  raise resp        # don't fall through
        if resp.status in next_on:  break             # next target
        if resp.status in retry_on: sleep(backoff(attempt)); continue
    # exhausted this target -> outer loop advances to next
raise AllTargetsFailed

Racing and hedged requests

Fallback is sequential: you pay the cost of waiting for target one to fail before trying target two. For latency-sensitive paths, racing flips that to parallel. The mechanics that matter:

Head start. Fire your preferred (cheaper or higher-quality) model first, and only launch the hedge after a short delay — say 300–700 ms, or once the primary crosses your p95. If the primary answers within the head-start window, you never pay for a second call at all. Hedging only the slow tail is what keeps the cost sane.
First-token-wins for streaming. In a streaming UI, the moment any racer emits its first token, commit to that stream and abort the others. This is the most effective fix for TTFT (time to first token) variance, because you're no longer hostage to whichever node happened to be slow.
Abort the losers. As soon as you have a winner, cancel the in-flight duplicates so you stop paying for tokens you'll discard. Without disciplined cancellation, a 2-way race can nearly double spend on every request.

Use racing where reliability and latency justify the duplicate-call cost — interactive chat, real-time agents, anything user-facing with a tight TTFT target. Don't race large batch jobs or background pipelines, where doubling token spend to shave tail latency is a bad trade.

Observability: you can't fix what you can't see

The instant you add fallback and racing, a single request becomes a tree of attempts — and a plain "200 OK at the edge" hides everything underneath. Log every attempt, not just the winner. For each one capture: target model and key, HTTP status and error class, retry number, the decision taken (retry / next / fail), TTFT, total latency, input and output tokens, and the computed cost.

With that data you can answer the questions that actually run a production LLM system: How often is my primary failing, and with which errors? What is my true blended cost once fallbacks and aborted races are counted? Is fallback target two quietly carrying 30% of traffic because the primary is chronically rate-limited? Which model wins races, and is the head-start tuned right? Without per-attempt accounting, fallback can mask a degrading provider until your bill or your latency graph forces the issue.

Pitfalls to design around

Double-charging. A primary that succeeds slowly while a hedge also completes means you paid twice. Aggressive cancellation and idempotency keys are the defense.
Non-deterministic outputs. Two models — or two runs of one model — give different answers. If anything downstream caches by request or expects a stable schema, pin response formats (JSON mode, tool schemas) so a fallback target's output stays parseable.
Streaming plus fallback is hard. Once you've streamed tokens to the client, you can't un-send them. If the primary fails mid-stream, you generally can't transparently fall back without restarting the response. Either fall back only before the first token, or build explicit replace-the-stream handling on the client.
Correlated failures. Three fallback targets on the same provider, or the same cloud region, share fate. Real failover needs genuine diversity across providers and infrastructure.

You don't have to build this in your app

Every technique above — error classification, backoff, fallback chains, load balancing, racing, per-attempt cost and TTFT accounting — is plumbing, not product. Implementing it inside your application code means reinventing a distributed-systems problem in every service that calls a model, and re-testing it on every provider change. This is exactly the job of an LLM router or gateway: it sits between your app and the providers, owns the retry-and-fallback policy as configuration, and exposes a single stable endpoint. Your code makes one call; the gateway decides where it actually goes.

flo2 is a developer-first LLM gateway built around exactly this: drag-to-reorder fallback chains with retry-on-429/5xx and skip-on-4xx classification, AI racing with a configurable head start and first-token-wins, and true per-attempt cost accounting — behind one OpenAI- and Anthropic-compatible key, with zero markup on your own provider keys. You get an LLM API that routes around failure without touching your application code — and it's free during Beta.

One key, every model — zero markup.

Bring your own provider keys. flo2 routes to the cheapest, fastest model with fallback, racing and true cost accounting — free during Beta.

Get your flo2 key →