2026-06-03 · flo2 blog

LLM Load Balancing: Spread Requests Across Keys & Providers

One API key has one quota. The moment your real traffic outgrows it, no amount of polite retrying conjures more capacity — you're just queuing against a wall. LLM load balancing is how you get past that wall: instead of funneling every request through a single key, you spread the load across multiple keys of one provider, across several providers serving an equivalent model, or both. Done well, it raises your effective rate limit, lifts throughput, improves reliability, and gives you a lever to optimize cost — without rewriting the application that makes the call.

Why LLM load balancing matters

Every major provider caps how fast you can send. Distributing requests is the most direct way around that wall, and it buys four things at once:

Work around per-key rate limits. A single key carries one RPM (requests per minute) and TPM (tokens per minute) budget. Spread requests across multiple API keys and your effective ceiling becomes the sum of their limits rather than any one key's — the cleanest way to stop hitting 429 on a busy endpoint.
Raise throughput. Rate is roughly concurrency times per-request duration, and each key and provider gives you an independent slice of concurrency. Balancing work across targets pushes more tokens per minute than any single backend allows.
Improve reliability. One key on one provider is a single point of failure — a bad deploy, a regional incident, or a revoked key takes you fully down. Spread across independent targets, one degrading backend costs you a fraction of capacity, not all of it.
Optimize cost. Once a request can land on more than one target, which target becomes a cost decision. The same open model is often priced differently across providers, so a policy can prefer the cheapest healthy option and spill to pricier ones only under pressure.

These overlap but differ — rate-limit relief argues for many keys on one provider, reliability for diversity across providers, cost for ranking by price — and a single distribution layer can serve all three.

Load balancing strategies for LLM traffic

"Spread the load" hides several algorithms, each with a different tradeoff.

Round-robin across keys

The baseline. Keep an ordered list of targets and hand each request to the next, wrapping around. Round-robin is stateless, trivial to reason about, and effective when targets are interchangeable. But it's blind: it gives a near-exhausted key as much traffic as a fresh one, and a slow provider as much as a fast one. The right default, the wrong final answer for heterogeneous fleets.

Weighted distribution

Assign each target a weight and route in proportion — the workhorse, because keys and providers are rarely equal. A key on a high usage tier with a 10,000 RPM ceiling should carry far more than a sandbox key capped at 500; give them weights of 20 and 1 and traffic splits accordingly. Weights also express preference: put 80% on your cheapest provider and 20% on a backup, and you've encoded a cost policy as a distribution. Tune them to each target's real rate limit, latency, and price.

Least-in-flight

Instead of a fixed ratio, route each request to whichever target has the fewest outstanding. This adapts in real time: a provider that slows down accumulates in-flight requests and automatically gets less new work, while a fast one drains its queue and pulls more. It shines exactly where round-robin fails — when per-request latency varies a lot, which for LLMs it always does (a 50-token reply and a 4,000-token reply are wildly different durations). It needs a live count per target, but tracks real capacity instead of assuming it.

Cost-aware routing

Rank targets by price and prefer the cheapest healthy one, escalating only when it's saturated or unavailable. A cheap provider serving an open model handles the bulk; the premium option becomes an overflow valve you pay for only when you must. Bias selection toward low price, then let failover handle the case where the cheap target can't take the request.

In production these layer rather than compete: weights set the baseline split, least-in-flight adapts to live latency, cost-awareness biases the ranking, and health checks remove dead targets entirely.

Handling 429 and 5xx with automatic failover

Load balancing decides where a request goes first; failover decides what happens when that choice fails. You need both — distribution prevents most overload proactively, failover catches the requests that slip through. Without it, a balanced system still errors whenever the selected target is the one being throttled. The logic hinges on classifying the response, because different failures demand different responses:

Retry the same target on transient errors — 429 (after backoff), 500, 502, 503, 529, timeouts. A blip often clears on a second attempt.
Move to the next target when retries are exhausted, or immediately on a signal this target is unusable: 401/403 (bad or revoked key), or a 429 that won't clear. A 429 from one provider says nothing about another, so failover across providers turns a hard rate-limit wall into a soft, transparent reroute.
Fail fast, do not fall through, on client errors every target rejects identically: 400, 422, or context-length overflow (413). Walking the whole pool on a 400 multiplies one bug across every key and burns latency for a result you can't change.

Pair this with exponential backoff and jitter on the retryable cases so a fleet of clients doesn't synchronize into a thundering herd against a recovering provider, and honor the Retry-After header on 429s when present. For the full error taxonomy, see fixing LLM 429 errors.

Statelessness vs. sticky sessions

Does the balancer remember anything between requests? For most LLM traffic the answer is a firm stateless, and that's a feature. Chat-completion and message endpoints are generally stateless on the provider side — you send the full conversation history each turn, so request N+1 has no dependency on which target served request N. You're free to route every request independently by weight, in-flight count, or cost. That's exactly what makes round-robin and least-in-flight work: any target can serve any request.

Sticky sessions — pinning a conversation to one target — are right only in narrower cases. The common one is provider-side prompt caching: some providers cache a large shared prefix (a long system prompt, a big document) and discount requests that reuse it, but the cache lives on one provider's infrastructure. Bouncing across providers throws away the hit, so pin the session to the provider holding the warm prefix — while still balancing across multiple keys within it. Default to stateless routing; reach for stickiness only to protect a cache hit or a provider-held thread.

Health checks and cooldown on a failing key

A static target list rots. A key gets rate-limited for a sustained stretch, a provider has a bad hour, a key gets revoked — and a naive balancer keeps routing its share into the failure. Make pool membership dynamic instead: track each target's recent health and pull unhealthy ones out until they recover.

Track recent outcomes per target. Keep a short rolling view of successes and failures. A spike in 429s or 5xxs, or a sustained latency climb, signals a degraded target.
Trip a cooldown (circuit breaker). When a target crosses a failure threshold, mark it unhealthy and stop sending it requests for a cooldown window — start at a few seconds, back off longer if it keeps failing. If a 429 carried Retry-After, use that value directly.
Probe before fully restoring. When the cooldown elapses, let a trickle through (half-open) before returning the target to full weight. Succeed and it's healthy again; fail and the cooldown extends. This stops a still-broken backend from flapping back into rotation.

Cooldowns turn a degrading provider from a recurring source of user-facing errors into a brief, self-healing dip in capacity — and they keep failover cheap, since a target in cooldown is skipped instantly instead of tried and fallen through on every request.

Weighted selection with failover, in pseudocode

Putting it together: weighted random choice over the healthy targets, retry-with-backoff on transient errors, skip-and-cooldown on a dead target, fail-fast on client errors.

targets = [
  { id: "openai-key-A",  weight: 20, healthy: true },   # high tier, most traffic
  { id: "openai-key-B",  weight: 10, healthy: true },   # second key, same provider
  { id: "anthropic-key", weight:  8, healthy: true },   # different provider: no correlated 429s
  { id: "groq-key",      weight:  4, healthy: true },   # cheap/fast overflow
]

RETRY_ON = {429, 500, 502, 503, 529, timeout}
FAIL_ON  = {400, 401, 403, 413, 422}    # terminal - never fall through

def pick_weighted(pool):
    r = random() * sum(t.weight for t in pool)
    for t in pool:
        r -= t.weight
        if r <= 0: return t
    return pool[-1]

def route(request):
    tried = set()
    while True:
        pool = [t for t in targets if t.healthy and t.id not in tried]
        if not pool: raise AllTargetsExhausted
        target = pick_weighted(pool)

        for attempt in range(MAX_RETRIES + 1):
            resp = call(target, request)
            if resp.ok:                  return resp        # done
            if resp.status in FAIL_ON:   raise resp         # client bug - stop
            if resp.status in RETRY_ON:
                if resp.status == 429:                       # this target is hot
                    cooldown(target, resp.header("Retry-After") or 10)
                    target.healthy = False
                    break                                    # leave the retry loop
                sleep(backoff(attempt) + jitter())           # 5xx/timeout: try again
                continue
            break                                            # unknown error: next target
        tried.add(target.id)                                 # advance to a fresh target

None of this is exotic — it's the same weighted-balancing, circuit-breaking, failover machinery every mature distributed system grows. That's the problem: it's a lot of load-bearing plumbing to build, test, and maintain in every service that calls a model, and to re-verify on every provider or pricing change.

Let a gateway do the load balancing for you

Weighted key selection, least-in-flight routing, health checks and cooldowns, 429/5xx failover with backoff, cost-aware ranking — these are infrastructure, not your product. The natural home for them is one layer down: an LLM gateway that sits between your app and the providers, owns the distribution and failover policy as configuration, and exposes one stable endpoint. Your code makes a single call; the gateway decides which key on which provider serves it. For the bigger picture see what is an LLM gateway, and for failover-and-racing mechanics in depth, LLM fallback and racing.

flo2 is a developer-first LLM gateway built for exactly this. Bring your own provider keys (OpenAI, Anthropic, Gemini, Groq, Cerebras, DeepInfra, Mistral, xAI, OpenRouter) and route every request through one OpenAI- and Anthropic-compatible key. Register multiple keys per provider, distribute load across keys and providers to multiply your effective rate limit, and define fallback chains that automatically reroute to another key or provider on a 429 or 5xx — all with zero token markup, since you pay the providers directly. It's load balancing without a balancer to build, and it's free during Beta.

One key, every model — zero markup.

Bring your own provider keys. flo2 routes to the cheapest, fastest model with fallback, racing and true cost accounting — free during Beta.

Get your flo2 key →