2026-06-03 · flo2 blog

Streaming + Fallback: Reliable LLM Streams Across Providers

Streaming is now the default expectation for any LLM-powered interface — tokens paint the screen as they arrive, perceived latency collapses, and users stay engaged. But llm streaming fallback is where things get genuinely hard: once you have flushed even a single byte of a server-sent event stream to a client, you cannot un-send it. A naive fallback that kicks in mid-stream silently truncates the response, leaves the client in an unknown state, and is nearly impossible to recover from without reloading. Getting this right requires thinking carefully about when you can still fall back — and engineering so that a bad provider never touches the client stream at all.

Why streaming breaks naive fallback strategies

With a normal, non-streaming LLM call the failure model is clean. The provider returns an error HTTP status, you catch it, you retry or route to the next provider, the client sees nothing until you have a complete response. The atomicity is free — the HTTP response is either success or failure.

Streaming destroys that atomicity. The flow looks like this:

Your server sends HTTP 200 and opens a text/event-stream response.
Tokens begin arriving from the provider and are forwarded to the client as SSE data: frames.
Somewhere in the middle — at token 47, say — the provider silently stalls, times out, or closes the connection.
Your server has already committed to the 200. The stream is open. The client has partial text on screen.

At this point you have no clean way out. You can close the stream with a [DONE] frame and the client sees a truncated response. You can send an error event, but most client SDKs do not handle mid-stream errors gracefully. You cannot issue a redirect. The HTTP response code was already sent. Fallback — trying the next provider — would mean starting a new request and sewing a new response onto the already-partially-rendered output, which is practically impossible to do transparently.

This is the core challenge: you can only fall back before the client stream is opened. Once you have committed, you are committed.

The right architecture: fail fast before forwarding

The correct approach is to never forward a single byte to the client until you have established that the upstream is healthy and generating. That means absorbing the initial risk inside a gateway layer rather than passing it directly to the client connection.

Strategy 1: Buffer until first token, then forward

Hold the upstream stream open on the gateway side and buffer until the first real content token arrives. Only once you have confirmed that the provider accepted the request, did not throw a 5xx, and emitted at least one content token, do you open the downstream stream and begin forwarding. If the upstream fails before that point, fall back to the next provider. The client sees a small additional delay (TTFT increases by the buffering window) and never sees a truncated stream. The practical buffering window is time to first token — 500 ms to a few seconds — a reasonable price for eliminating mid-stream corruption.

Strategy 2: Fail fast on connection and pre-generation errors

Many failures happen before any generation occurs — a 429 rate limit, a 503 from an overloaded backend, a timeout waiting for the connection to establish, a 400 bad request. These are the easiest case: the upstream never started generating, so no buffering is even needed. Your gateway should detect these synchronously, fall back immediately, and the client never sees the failed attempt at all.

Error classification matters: transient errors (429, 500–529, timeouts, connection resets) are worth routing to the next provider; permanent errors (400, 401, 403, content refusals) should fail fast rather than burning fallback slots on requests that will fail everywhere.

Strategy 3: Racing for latency and resilience together

A third approach combines streaming fallback with LLM fallback and racing: issue the request to two providers simultaneously, buffer both responses until the first token arrives from either, then forward the winner and cancel the loser. The client gets a single stream from whichever provider responded fastest and successfully. The losing request is cancelled and its cost is the hedging tax you pay for the latency guarantee.

Racing is especially useful when a provider has high-variance TTFT — usually fast but occasionally stalling for seconds before generation starts. A consistent secondary flattens that variance at the p95 and p99 without giving up the primary's typical quality.

SSE streaming: a quick primer

Server-sent events (SSE) is the standard transport for streamed LLM responses. The client opens one persistent HTTP connection; the server pushes data: frames over it as tokens are generated. Each frame is separated by a double newline. The stream ends with data: [DONE].

HTTP/1.1 200 OK
Content-Type: text/event-stream
Cache-Control: no-cache
Connection: keep-alive

data: {"id":"chatcmpl-abc","choices":[{"delta":{"role":"assistant"},"index":0}]}

data: {"id":"chatcmpl-abc","choices":[{"delta":{"content":"The"},"index":0}]}

data: {"id":"chatcmpl-abc","choices":[{"delta":{"content":" answer"},"index":0}]}

data: {"id":"chatcmpl-abc","choices":[{"delta":{"content":" is"},"index":0}]}

data: [DONE]

OpenAI and Anthropic differ in their delta structure, but the SSE transport is identical — a gateway normalizes both. The key constraint visible in this wire format: once that first data: frame hits the TCP socket, the client has bytes. There is no recall.

Surfacing errors cleanly vs. silent truncation

When a stream must terminate early — because all fallback providers failed, because you hit a hard timeout, because the context window was exceeded mid-response — you have two options: silent truncation, or an explicit error signal. Silent truncation is always wrong. The client cannot distinguish "the model finished" from "something went wrong," leading to silent data quality issues that are much harder to debug than an explicit failure.

The right behavior is to emit a structured error event before closing the stream. The OpenAI streaming format has no formal error event type, but the convention is to emit a final data frame with an error field rather than a choices field, followed by [DONE]. Client code that checks for this can display an appropriate message or trigger a retry at the application layer.

data: {"error":{"message":"All upstream providers failed","type":"gateway_error","code":"fallback_exhausted"}}

data: [DONE]

If the buffer-before-forward strategy has already committed some content to the client, emit the error event anyway, let the client know the response is incomplete, and include how many tokens were delivered so the application can decide whether to retry or surface the partial result.

Idempotency and streaming retries

Streaming requests are not idempotent — a retry will produce a different stochastic response. That is usually fine and even desirable: the fallback provider generates a fresh, complete answer rather than splicing onto a partial one. What matters is that the gateway prevents the client stream from opening before a healthy upstream is confirmed (buffer-before-forward), or clearly signals incompleteness if content was already forwarded. For cache-heavy workloads where exact reproducibility matters, caching the full non-streaming response and replaying it as a synthetic SSE stream avoids the truncation problem entirely. For more on squeezing TTFT down, see reduce LLM latency.

How a gateway implements this for you

Building all of this correctly in application code is tedious: connection management across provider SDKs, error classification, buffering logic, SSE re-serialization, racing with cancellation, and structured error events on exhaustion. It is the same across every LLM-powered feature in your stack, so it belongs one layer below your application — in a gateway.

An LLM gateway absorbs all of this as a primitive. Configure a fallback chain and optionally enable racing; the gateway buffers until first token, picks the winning upstream, and forwards one clean stream to your client. Your application code calls a single OpenAI-compatible endpoint with stream: true and gets back a resilient SSE stream. Provider failures and slow tails become operational concerns for the gateway, not your app.

flo2 implements streaming fallback and racing natively — buffer-before-forward, configurable fallback chains, provider racing, and clean error events on exhaustion — with zero token markup and full support for your own provider API keys. You bring the keys; flo2 handles the stream reliability. It is free during beta.

One key, every model — zero markup.

Bring your own provider keys. flo2 routes to the cheapest, fastest model with fallback, racing and true cost accounting — free during Beta.

Get your flo2 key →