2026-05-31 · flo2 blog

LLM Fallback and Racing: Build an LLM API That Never Goes Down

Your LLM app worked perfectly in the demo. Then a model provider had a bad afternoon — a regional outage, a rate-limit spike during a traffic surge, a cluster of 503s — and your product went down with it. If your application talks to exactly one model behind exactly one API key, you have built a single point of failure into the most load-bearing part of your stack. This is the problem LLM fallback solves, and it's the foundation of any LLM API that stays up when individual providers don't.

Why single-provider LLM apps break

Model APIs are remarkably good, but they are still remote services subject to the same failure modes as any other distributed system — plus a few of their own. In production you will eventually hit all of these:

Each of these has a different correct response, which is exactly why a naive "wrap it in a try/catch and retry" loop is not enough. Retrying a 429 helps; retrying a 400 just wastes time and money. Surviving an outage requires leaving the provider entirely. To handle production traffic you need a small vocabulary of distinct techniques.

Retries vs. fallback vs. load balancing vs. racing

These four terms get used interchangeably, but they are different tools for different failures. Getting the distinction right is most of the battle.

Retries

Re-issuing the same request to the same model after a transient failure. Retries fix blips — a one-off 503, a momentary timeout. They do nothing for a sustained outage or a hard rate limit, and without a cap and backoff they make an overloaded provider worse. Retries are a within-target tactic, not a resilience strategy.

LLM fallback (failover)

When the primary target is exhausted — it has failed after N retries, or returned a terminal error — you move to a different model or provider. This is LLM failover, and it's what gets you through a full outage: GPT-class primary down, so you slide to a Claude or Gemini secondary, then to a self-hosted model as a last resort. Fallback changes the destination; retries do not.

LLM load balancing

Spreading requests proactively across multiple equivalent targets — several API keys, several regions, several providers serving the same model — to stay under per-key rate limits and smooth out capacity. Where fallback is reactive (move after failure), LLM load balancing is preventive (distribute before failure). The two compose well: balance across healthy targets, fall back when one degrades.

LLM racing (hedged requests)

Send the request to two or more models at once — optionally giving your preferred model a head start — and keep whichever responds first, aborting the rest. LLM racing is the only one of these four that attacks slow-tail latency directly, because it doesn't wait for a timeout to fire before trying an alternative. It trades money for speed and reliability.

Designing a fallback chain

A fallback chain is an ordered list of targets, tried top to bottom until one succeeds. Three decisions define a good one.

1. Ordering: cost and quality

Order is policy, not an afterthought. The common pattern is best-quality-first: your strongest model at the top, progressively cheaper or faster alternatives below as graceful degradation — a slightly worse answer beats an error page. The inverse, cheapest-first, treats premium models as an overflow valve you only pay for when the budget option is unavailable. Either is valid; what matters is that you chose deliberately and can reorder as prices and model quality shift.

2. Which errors are retryable vs. terminal

This is the single most important piece of logic in the whole system. Error classification decides whether you retry the same target, skip to the next one, or stop and surface the failure.

3. Backoff and idempotency

Retries should use exponential backoff with jitter so a fleet of clients doesn't synchronize into a thundering herd against a recovering provider. Honor the Retry-After header on 429s when present. And treat every attempt as potentially observable: if a fallback fires after a primary call may have partially succeeded, you want at-most-one user-visible completion — send an idempotency key or design the call so a duplicate is harmless.

Conceptually, the routing logic is small:

route "chat-resilient":
  targets:
    - openai/gpt-5.1         # primary: best quality
    - anthropic/claude-opus  # fallback 1: different provider, no correlated outage
    - self-hosted/llama-70b  # fallback 2: last resort, always-on

  retry:    { max: 2, backoff: exponential, jitter: true }
  retry_on:    [429, 500, 502, 503, 529, timeout]
  next_on:     [401, 403, 404]   # this target is dead, move on
  fail_on:     [400, 422, 413]   # terminal, surface to caller

for target in targets:
    for attempt in 0..retry.max:
        resp = call(target, request)
        if resp.ok:            return resp
        if resp.status in fail_on:  raise resp        # don't fall through
        if resp.status in next_on:  break             # next target
        if resp.status in retry_on: sleep(backoff(attempt)); continue
    # exhausted this target -> outer loop advances to next
raise AllTargetsFailed

Racing and hedged requests

Fallback is sequential: you pay the cost of waiting for target one to fail before trying target two. For latency-sensitive paths, racing flips that to parallel. The mechanics that matter:

Use racing where reliability and latency justify the duplicate-call cost — interactive chat, real-time agents, anything user-facing with a tight TTFT target. Don't race large batch jobs or background pipelines, where doubling token spend to shave tail latency is a bad trade.

Observability: you can't fix what you can't see

The instant you add fallback and racing, a single request becomes a tree of attempts — and a plain "200 OK at the edge" hides everything underneath. Log every attempt, not just the winner. For each one capture: target model and key, HTTP status and error class, retry number, the decision taken (retry / next / fail), TTFT, total latency, input and output tokens, and the computed cost.

With that data you can answer the questions that actually run a production LLM system: How often is my primary failing, and with which errors? What is my true blended cost once fallbacks and aborted races are counted? Is fallback target two quietly carrying 30% of traffic because the primary is chronically rate-limited? Which model wins races, and is the head-start tuned right? Without per-attempt accounting, fallback can mask a degrading provider until your bill or your latency graph forces the issue.

Pitfalls to design around

You don't have to build this in your app

Every technique above — error classification, backoff, fallback chains, load balancing, racing, per-attempt cost and TTFT accounting — is plumbing, not product. Implementing it inside your application code means reinventing a distributed-systems problem in every service that calls a model, and re-testing it on every provider change. This is exactly the job of an LLM router or gateway: it sits between your app and the providers, owns the retry-and-fallback policy as configuration, and exposes a single stable endpoint. Your code makes one call; the gateway decides where it actually goes.

flo2 is a developer-first LLM gateway built around exactly this: drag-to-reorder fallback chains with retry-on-429/5xx and skip-on-4xx classification, AI racing with a configurable head start and first-token-wins, and true per-attempt cost accounting — behind one OpenAI- and Anthropic-compatible key, with zero markup on your own provider keys. You get an LLM API that routes around failure without touching your application code — and it's free during Beta.

One key, every model — zero markup.
Bring your own provider keys. flo2 routes to the cheapest, fastest model with fallback, racing and true cost accounting — free during Beta.
Get your flo2 key →
© 2026 flo2.com — the zero-markup LLM gateway & router. flow → to