2026-06-03 · flo2 blog

Handling LLM Timeouts: Sensible Limits, Streaming & Fallback

An LLM timeout is one of the most mishandled failure modes in production AI applications. Set the timeout too short and you kill perfectly healthy long generations mid-stream; set it too long — or skip it entirely — and a hung connection blocks your worker forever. Getting this right requires understanding why LLM requests are inherently slow and variable, how to structure client timeouts so they don't cause spurious failures, and how streaming and fallback turn a fragile single-shot call into a resilient one. This guide covers all of it.

Why LLM requests are slow and unpredictable

LLM requests are not like REST API calls to a database. The latency is dominated by three independent variables that can each spike independently:

Output length. A model generating a 50-token reply finishes in roughly 0.5–1 s on a fast provider. A 4,000-token structured JSON output at the same tokens-per-second rate takes 40–80 s. Your timeout must cover the expected output length, not just the round-trip.
Context window size. Prefilling a large prompt — thousands of tokens of system instructions, retrieval hits, or conversation history — adds significant time before the first token is even generated. Time-to-first-token (TTFT) can range from 300 ms on an empty context to several seconds on a full 128 k window.
Provider load. Shared inference clusters exhibit multi-second tail latency spikes when a model is under heavy demand. The same request that returns in 2 s at 2 AM may take 12 s during a post-announcement traffic surge on the same model.

This variance is not a bug — it's the nature of autoregressive generation. Any timeout policy that ignores it will produce spurious failures on legitimate requests.

Why naive short timeouts cause spurious failures

A common mistake is copying an HTTP timeout from a synchronous API (5 s, 10 s) and applying it to an LLM call. This works fine for short completions in a quiet environment and silently kills requests in every other case. The failure mode looks like flakiness: the same prompt succeeds 80% of the time and times out the other 20%, with no clear pattern — because the 20% happened to generate longer output, or hit a busier backend node.

The opposite mistake — no timeout at all — is worse. A hung connection holds an open socket and a blocked worker thread. In a Node.js or Python server handling concurrent requests, a handful of hung LLM calls can exhaust the thread pool or event loop and make the entire service unresponsive, not just the affected requests.

Setting sensible client timeouts

The right mental model splits the timeout into two independent values:

Connect timeout. How long to wait to establish the TCP/TLS connection to the provider's API. This should be short — 5–10 s is generous. A TCP handshake that takes longer than that is a network problem, not an LLM latency problem.
Read timeout (or total timeout). How long to wait for data to arrive on an established connection. For non-streaming calls this covers the full response; for streaming it is the maximum idle gap between tokens. This must be calibrated to your expected output length, not to your patience.

A practical calibration heuristic: take your max expected output token count, divide by the provider's typical tokens-per-second (often 40–100 tps for frontier models, faster on Groq/Cerebras), and add a 50% buffer for variance plus your max expected TTFT. For a call that might return 2,000 tokens at 60 tps with a 3 s TTFT, that's roughly 33 s + 3 s + 50% ≈ 54 s. Round up to 60 s. For a short summarization call capped at 300 tokens, 15–20 s is plenty.

If you use different models for different tasks, set per-model timeouts. A GPT-4o mini call for a quick classification deserves a tighter ceiling than a Claude Opus call doing multi-step reasoning over a long document.

Using streaming to avoid full-response timeouts

Streaming is the single most effective tool for managing LLM timeout anxiety, for a simple reason: you start receiving output before the generation is complete. Instead of waiting 45 s for a full response and wondering whether the connection is alive, you see the first token in 1–2 s and continue receiving tokens in a steady stream.

With streaming, the timeout that matters is not "how long until the full response arrives" but "how long can I tolerate silence between tokens." A reasonable inter-token idle timeout is 10–15 s — if no token has arrived in that window, something is wrong. This is far easier to calibrate than a total-response timeout and far more sensitive to actual hangs versus legitimately slow generation.

Streaming also improves perceived latency for your users: they see words appearing rather than a blank loading state for 30 s. It is almost always the right choice for user-facing completions.

Distinguishing a true hang from a slow-but-working call

Not every slow response is a hang. The critical signal is token flow:

Tokens arriving, just slowly — the model is generating; the provider is under load. Do not time out mid-stream. If you kill this call you will retry and likely hit the same congestion.
No tokens for N seconds after the connection opened — TTFT is elevated, probably due to long prefill or queue delay. Give it a generous TTFT window (up to 20–30 s for large contexts) before treating it as a hang.
Connection established, then silence with no token flow — this is a genuine hang. The server accepted your request but something went wrong before generation began or partway through. This is worth timing out and retrying or failing over.

If you are not streaming, you cannot distinguish these cases at all — every slow response looks identical until your total timeout fires. This is another argument for defaulting to streaming wherever possible.

Combining timeouts with retries and fallback

A timeout that fires without a next step is just a failure. The full pattern combines three layers:

Timeout — cap how long you wait for any single attempt.
Retry with backoff — on a timeout from a transient cause (provider load spike), retry the same model with a short delay. Keep the retry count low (1–2) to avoid compounding latency. See LLM retries & backoff for the full classification and backoff implementation.
Fallback to another provider — if the primary model is consistently timing out (provider incident, sustained load), move the request to a secondary. A timeout that keeps repeating after retries is a signal to leave the provider, not keep hammering it. See LLM fallback and racing for how to build and order fallback chains.

Racing — sending the request to two providers simultaneously and taking the first to respond — is the nuclear option for latency-sensitive paths. It works, but it doubles provider cost, so reserve it for cases where tail latency directly impacts revenue or user experience.

Timeout + fallback in code

Here is a minimal but production-realistic Python example using httpx (which exposes separate connect and read timeouts) with a fallback chain:

import httpx

PROVIDERS = [
    {
        "url": "https://api.openai.com/v1/chat/completions",
        "headers": {"Authorization": "Bearer sk-YOUR-OPENAI-KEY"},
        "model": "gpt-4o-mini",
    },
    {
        "url": "https://api.anthropic.com/v1/messages",
        "headers": {
            "x-api-key": "sk-ant-YOUR-ANTHROPIC-KEY",
            "anthropic-version": "2023-06-01",
        },
        "model": "claude-haiku-4-5",
    },
]

TIMEOUT = httpx.Timeout(
    connect=8.0,   # TCP/TLS handshake — short, network-level
    read=45.0,     # Time between data chunks (or full response if not streaming)
    write=10.0,
    pool=5.0,
)


def build_payload(provider: dict, user_message: str) -> dict:
    if "anthropic" in provider["url"]:
        return {
            "model": provider["model"],
            "max_tokens": 1024,
            "messages": [{"role": "user", "content": user_message}],
        }
    return {
        "model": provider["model"],
        "messages": [{"role": "user", "content": user_message}],
    }


def chat_with_fallback(user_message: str) -> str:
    last_error = None
    for provider in PROVIDERS:
        try:
            with httpx.Client(timeout=TIMEOUT) as client:
                resp = client.post(
                    provider["url"],
                    headers=provider["headers"],
                    json=build_payload(provider, user_message),
                )
            resp.raise_for_status()
            data = resp.json()
            # Normalize across provider response shapes
            if "anthropic" in provider["url"]:
                return data["content"][0]["text"]
            return data["choices"][0]["message"]["content"]

        except httpx.TimeoutException as exc:
            print(f"[fallback] {provider['model']} timed out: {exc}. Trying next.")
            last_error = exc
            continue
        except httpx.HTTPStatusError as exc:
            if exc.response.status_code in {429, 500, 502, 503}:
                print(f"[fallback] {provider['model']} returned {exc.response.status_code}. Trying next.")
                last_error = exc
                continue
            raise  # 400/401/422: not retryable, surface immediately

    raise RuntimeError(f"All providers failed. Last error: {last_error}") from last_error


if __name__ == "__main__":
    reply = chat_with_fallback("Summarize the key ideas in zero-knowledge proofs in two sentences.")
    print(reply)

The key points in this implementation: the connect and read timeouts are split at the httpx.Timeout level; TimeoutException and transient HTTP errors both trigger a fallback; permanent errors (400/401/422) are re-raised immediately without wasting attempts on providers that will give the same answer; and the fallback list is just a Python list you can extend, reorder, or load from config.

Centralizing timeout and fallback policy in a gateway

The code above works, but it has a problem: every service that calls an LLM has to re-implement this logic. When you need to tighten timeouts because a provider is having a slow week, or add a third fallback model, you are updating N codebases instead of one config file. This is precisely what an LLM gateway is for.

A gateway sits between your application and your providers. You define timeout thresholds and fallback chains once, in the gateway config. Your application code makes a single OpenAI-compatible call to the gateway and gets back a response — the gateway handles the connect timeout, read timeout, retry, fallback, and provider-key rotation transparently. When provider behavior changes, you update the gateway config, not your application.

flo2 is a developer-first LLM gateway that handles timeout policy, fallback chains, racing, and streaming out of the box — with zero token markup and support for your own provider keys. You write one httpx (or fetch) call; flo2 handles the rest. Free during Beta.

One key, every model — zero markup.

Bring your own provider keys. flo2 routes to the cheapest, fastest model with fallback, racing and true cost accounting — free during Beta.

Get your flo2 key →