2026-06-03 · flo2 blog

How to Reduce LLM Latency: TTFT, Streaming, Racing & Caching

A model that answers in 400 milliseconds feels like a tool; one that takes eight seconds feels broken. If you ship anything interactive on top of large language models, the work to reduce LLM latency is some of the highest-leverage engineering you can do — it changes how your product feels far more than another point of benchmark accuracy ever will. The good news is that latency is not one number you're stuck with. It's a stack of separate components, each with its own lever, and most of them are under your control without touching the model weights at all.

This guide breaks down where the milliseconds actually go, then walks through the concrete levers — streaming, model choice, output length, caching, and racing — plus how to measure so you're optimizing reality instead of vibes.

The anatomy of LLM latency

"Slow" is too coarse to optimize. The single most useful split is between when the first token arrives and when the last one does, because they have completely different causes and fixes.

Time to first token (TTFT)

Time to first token — TTFT — is how long the user waits before anything appears. It's what people actually perceive as "responsiveness," and it's the sum of three things you should think about separately:

Network round-trip. The bytes have to travel to the provider and back. Usually small, but it grows with geographic distance and balloons if you're opening a fresh TLS connection on every call instead of reusing a pooled one.
Provider queue time. Before your request is processed, it may sit in the provider's queue waiting for capacity. This is the invisible, highly variable component — it's near zero when the provider is idle and seconds long when a popular model is saturated. You don't control it, but you can route around it.
Prefill. The model reads and encodes your entire prompt before it generates a single output token. Prefill scales with input size, so a 30k-token RAG context costs real TTFT every call — a long prompt is slow to start, not just slow to finish.

Total completion time and tokens per second

Once generation starts, total time is dominated by how many tokens you ask for and how fast the model emits them. The relationship is almost arithmetic:

total_latency ≈ TTFT + (output_tokens ÷ tokens_per_second)

TTFT          = network + queue_time + prefill
generation    = output_tokens ÷ tokens_per_second   # the "stream" phase

Tokens per second (the inter-token rate, sometimes called TPS or output speed) is a property of the model and, crucially, the inference stack serving it. The same open-weight model can run several times faster on a specialized inference host than on a general-purpose one. So two of your three biggest dials — output length and tokens per second — live entirely on the generation side, and the third, prefill, lives on the input side. Optimize all three and you've covered most of the curve.

Lever 1: Stream tokens to crush perceived latency

The cheapest, highest-impact change for most apps isn't making the model faster — it's not making the user wait for the whole answer. With streaming (server-sent events / stream: true), tokens render as they're generated, so the user's wait collapses to TTFT instead of total completion time. A 6-second response that starts painting text at 500 ms feels dramatically faster than the same 6 seconds delivered as one silent block at the end.

Streaming doesn't reduce total latency — the last token still arrives at the same moment — but for anything a human reads in real time, perceived latency is the metric that matters, and TTFT is what governs it. This is also why LLM streaming latency work tends to focus obsessively on shrinking TTFT: in a streaming UI, TTFT is the responsiveness number. The only places to skip streaming are non-interactive ones — batch jobs, or calls whose full output you must parse before doing anything (strict JSON you can't act on partially).

Lever 2: Pick faster models and inference providers

Not every request needs your smartest model, and the speed gap between options is enormous. Two independent choices govern raw speed:

Model size. Smaller models generate more tokens per second and usually prefill faster. For latency-critical paths — autocomplete, routing, classification, short tool-calls — a smaller "mini"/"flash"-class model is often both fast enough and good enough.
Inference provider. The host serving a model matters as much as the model itself. Specialized inference providers like Groq and Cerebras are built for high throughput and are typically much faster — often dramatically so — than general-purpose endpoints running comparable open-weight models. For a faster LLM response on latency-sensitive work, the same model on a faster host can be the entire fix.

Treat all relative speed claims (this guide's included) as a starting hypothesis, not gospel: benchmark from your own region with your own prompts. Tokens-per-second and TTFT vary by prompt shape, output length, time of day, and current load, and the only numbers that matter are the ones you measure against your real traffic.

Lever 3: Generate fewer tokens

Because total time scales directly with output length, the fastest token is the one you never generate. Two habits pay off immediately:

Cap max_tokens. An unbounded limit lets a chatty model run long for no benefit. Set the ceiling to what the task genuinely needs — a classifier returning a label doesn't need room for 800 tokens.
Ask for brevity. Instruct the model to be concise, return structured fields instead of prose, and skip the preamble ("Here's a summary of…"). Fewer output tokens is fewer seconds, on every single call.

On the input side, trimming context cuts prefill and therefore TTFT: retrieve fewer, better chunks for RAG and summarize long histories instead of resending them verbatim. Shorter in, shorter out — both ends get faster.

Lever 4: Prompt caching to cut prefill

When a large, static chunk of your prompt repeats across calls — a long system prompt, few-shot examples, a fixed document or tool schema — prompt caching (offered natively by several providers) lets the model skip re-encoding it. The cached prefix is processed once and reused, so prefill on subsequent calls drops sharply and TTFT improves for any workload built on shared context. The win grows with how big and how stable that prefix is, which makes RAG and agent loops — where the same instructions ride along on every turn — the prime beneficiaries.

Lever 5: Response caching for instant repeats

Prompt caching speeds up the prefill of a request you still run; response caching skips the model entirely. If an identical request comes in again, you return the stored answer — no network, no queue, no generation. That's not a faster LLM call; it's a near-zero-latency one. Anywhere your traffic repeats — FAQ-style queries, idempotent pipeline steps, popular prompts served to many users — opt-in response caching turns "fast" into "instant" for the hits and removes that load from the provider entirely. The trade-off is staleness, so cache only where a slightly older answer is acceptable, and scope keys to the inputs that actually affect the output.

Lever 6: Racing for tail-latency protection

Even a fast model has a bad p99. A single request can land in a slow provider queue or get stuck behind a degraded node, blowing your latency budget while nothing technically "failed." Retrying after a timeout only adds the timeout to the wait. Racing (hedged requests) attacks this directly: fire the same prompt at two or more models or providers in parallel and serve whichever responds first, aborting the rest.

In a streaming UI, the rule is first-token-wins — the instant any racer emits a token, commit to that stream and cancel the others. Because you're no longer hostage to whichever endpoint happened to be slow this second, racing is the single most effective fix for TTFT variance. The cost is real: a two-way race can nearly double token spend if you let both finish, so give your preferred model a short head start and only launch the hedge for the slow tail, then cancel losers aggressively the moment you have a winner. Reserve racing for latency-critical, user-facing paths where shaving the tail justifies the duplicate-call cost — not for batch work where doubling spend to save a second is a bad trade.

Measure: log TTFT and total on every call

You can't optimize latency you don't record, and an end-to-end stopwatch hides which component is the culprit. Instrument every call to capture, at minimum:

TTFT — request sent to first token received. Your perceived-latency number.
Total latency — request sent to last token. Your throughput number.
Output tokens — so you can derive tokens-per-second (output_tokens ÷ generation_time) and compare models and hosts on equal footing.
Model, provider, and region — so a slow p99 is attributable to a specific target rather than a mystery.

Watch the distribution, not just the average — median TTFT can look fine while the p95 quietly ruins the experience for one user in twenty. With per-call TTFT and total logged, every lever above becomes testable: did switching inference providers actually raise tokens-per-second? Did prompt caching move median TTFT? Did racing tighten the p99? Without the numbers, you're guessing.

Putting it together

Reducing LLM latency is a stack of compounding moves, not a single switch: stream so the wait becomes TTFT, pick a fast-enough model on a fast inference host, generate fewer tokens, cache prefixes and whole responses, race the tail, and measure all of it. The hard parts — response caching, parallel racing with first-token-wins and disciplined cancellation, and true per-call TTFT and cost accounting — are plumbing you'd otherwise rebuild in every service. That's exactly the job of an LLM gateway: it sits between your app and the providers and owns this behavior as configuration behind one stable endpoint.

flo2 is a developer-first, bring-your-own-key LLM gateway built for this work: one OpenAI- and Anthropic-compatible key that can route each request to the fastest model, opt-in response caching, AI racing with a configurable head start and first-token-wins, and true per-call latency and cost accounting — with zero markup on your own provider keys, so you can add Groq, Cerebras, and the rest and benchmark them head to head. For more on the resilience side of parallel calls, see LLM fallback and racing. It's free during Beta, so you can start measuring and shaving your real latency today.

One key, every model — zero markup.

Bring your own provider keys. flo2 routes to the cheapest, fastest model with fallback, racing and true cost accounting — free during Beta.

Get your flo2 key →