2026-06-03 · flo2 blog

Time to First Token (TTFT): What It Is & How to Improve It

The moment a user hits enter, a clock starts — and the only thing that quiets it is the first character appearing on screen. That gap is time to first token, and for anything interactive built on large language models it's the single number that decides whether your product feels alive or frozen. This is a practical look at TTFT: what it is, what drives it, why it dominates perceived responsiveness in chat and agents, the moves that shrink it, and how to measure it honestly.

What is TTFT (time to first token)?

TTFT is the elapsed time from the instant you send a request to the instant the first streamed token comes back. Nothing more. It doesn't care how long the full answer is or how fast tokens flow after the first one. It measures one thing: how long the user stares at a blank space before the response begins.

That makes it easy to confuse with two neighbors it is not:

Total completion time is request-sent to last token. It includes TTFT plus the entire generation phase, so it grows with how many tokens you ask for. A short prompt with a long answer can have a tiny TTFT and a large total.
Inter-token latency (the gap between consecutive tokens, the inverse of tokens-per-second) governs how fast text flows after the stream starts. It's the model's "typing speed" — a completely separate property from how long it took to start typing.

A compact way to hold all three in your head:

total_completion_time = TTFT + (output_tokens × inter_token_latency)

TTFT                  = network_round_trip + provider_queue + prefill
inter_token_latency   = 1 ÷ tokens_per_second

TTFT deserves its own dashboards because it has different causes and fixes from the generation phase. You can run a model with a blazing tokens-per-second rate and still ship a sluggish-feeling app because TTFT is high — they are independent dials.

What drives first token latency

First token latency is a sum of components, and knowing which one is hurting you is half the battle.

Prefill (prompt size)

Before a model emits a single output token, it has to read and encode your entire prompt — system message, history, retrieved context, tool schemas, the lot. This prefill phase scales with input length. A 200-token prompt prefills almost instantly; a 30,000-token RAG context costs real TTFT on every call. The trap: a long prompt isn't just slow to finish, it's slow to start. If your TTFT is bad and your prompts are large, prefill is the prime suspect.

Provider queue and load

Your request may sit in the provider's queue waiting for compute before processing even begins — the most variable part of TTFT. It's near zero when the provider is idle and seconds long when a popular model is saturated at peak. You don't control the queue, but you control whether you stay hostage to it. When TTFT is fine at 3am and terrible at 2pm with unchanged prompts, you're looking at queue time.

Network round-trip

The bytes physically travel to the provider's region and back. Usually the smallest slice, but it grows with distance and balloons if you open a fresh TLS connection on every request instead of reusing a pooled, keep-alive one. Cold connections and cross-continent endpoints quietly tax TTFT here.

Model and hardware

The model and the inference stack serving it set a floor. Larger models generally prefill more slowly, and the same open-weight model can have very different TTFT depending on the hardware and serving software behind it. Specialized inference hosts are engineered to start streaming quickly; general-purpose endpoints may not be. The model name alone doesn't tell you your TTFT — where it runs matters just as much.

Why TTFT matters for UX

Humans read responsiveness almost entirely off the start of a response, not its end. A reply that begins painting text in a few hundred milliseconds feels instant even if it streams for several seconds; the same content delivered as one silent block after an identical total time feels broken. That gap is why TTFT, not total latency, is the UX metric for anything a person watches in real time. The stakes climb in two settings:

Chat. Users tolerate a long, detailed answer happily — as long as it starts quickly. High TTFT reads as the app hanging, and it's the most common reason a capable chatbot feels unpleasant.
Agents and tool chains. Multi-step agents pay TTFT on every hop, and each call is often preceded by a large, growing prompt (full history plus tool schemas), so prefill compounds. TTFT isn't one cost you pay once — it's a tax on every step, frequently the dominant share of wall-clock time.

This is also why a streaming UI lives or dies by TTFT: once you stream, the user's perceived wait collapses to the first token, so TTFT effectively becomes your responsiveness number.

How to improve time to first token

Most of TTFT is under your control without touching model weights. Here are the levers, roughly in order of effort-to-impact.

1. Stream the response

The highest-leverage change for any interactive app isn't making the model faster — it's not making the user wait for the whole answer. With streaming (server-sent events, stream: true), text renders token by token, so the wait shrinks from total completion time down to TTFT. Streaming doesn't lower TTFT itself, but it makes TTFT the only latency the user feels — which is why every other lever here aims at it. Skip it only for non-interactive work: batch jobs, or calls whose full output you must parse before acting.

2. Shrink the prompt

Because prefill scales with input size, fewer input tokens means a faster first token. Retrieve fewer, better chunks for RAG instead of stuffing the context window; summarize long histories rather than resending them verbatim every turn; trim bloated system prompts and tool schemas. Smaller in, faster to start — on every call, for free.

3. Use prompt caching to cut prefill

When a large, static chunk of your prompt repeats across calls — a long system prompt, few-shot examples, a fixed document, a tool schema — prompt caching (offered natively by several providers) lets the model skip re-encoding it. The cached prefix is processed once and reused, so prefill on later calls drops sharply and TTFT improves for any workload built on shared context. RAG pipelines and agent loops are the biggest winners, precisely because their stable prefix is large and repeats constantly.

4. Choose faster inference providers

Not every request needs your largest model. For latency-critical paths — autocomplete, routing, classification, short tool calls — a smaller "mini" or "flash" class model often prefills faster and is good enough. And the host matters as much as the model: specialized inference providers built for high throughput typically start streaming sooner than general-purpose endpoints running comparable models, so the same model on a faster host can be the entire fix. Treat all relative speed claims (this guide's included) as a hypothesis to benchmark from your own region with your own prompts — TTFT shifts with prompt shape, time of day, and load.

5. Use warm, regional endpoints

Cut the network and cold-start slices: reuse pooled keep-alive connections instead of negotiating fresh TLS per request, and prefer an endpoint geographically close to your servers. This won't fix prefill, but it removes a needless, recurring tax on TTFT.

6. Race several providers for the first token

Even a fast endpoint has a bad tail: a single request can land in a slow queue or behind a degraded node, blowing your TTFT budget while nothing technically "fails," and retrying after a timeout only adds the timeout to the wait. Racing (hedged requests) attacks this directly — fire the same prompt at two or more providers in parallel and commit to whichever streams a token first, aborting the rest. In a streaming UI the rule is first-token-wins: the instant any racer emits a token, lock onto that stream and cancel the others. Because you're no longer hostage to whichever endpoint was slow this second, racing is the most effective fix for TTFT variance and tail latency. It isn't free — a two-way race can nearly double token spend if you let both finish — so give your preferred provider a short head start, hedge only the slow tail, and cancel losers immediately. Reserve it for user-facing, latency-critical paths.

How to measure TTFT properly

You can't improve a number you record incorrectly, and an end-to-end stopwatch hides which component is the culprit. To measure TTFT honestly:

Time the right boundaries. Start the clock when you send the request; stop it on the first streamed chunk that contains content — not when the response completes. If you're not streaming, you can't measure true TTFT at all.
Separate TTFT from total. Log request-to-first-token and request-to-last-token as distinct fields, plus output token count, so you can derive tokens-per-second and never conflate a slow start with a long answer.
Watch the distribution. Median TTFT can look healthy while p95 quietly ruins the experience for one user in twenty, so track p50/p95/p99 — tail TTFT is where queue spikes and bad nodes hide. Tag every call with model, provider, and region so a slow p99 is attributable, not a mystery.
Measure under realistic load. A single warm request from your laptop is the best case. Benchmark with concurrency and real prompt sizes, because prefill and queue time both behave very differently under pressure.

With per-call TTFT logged this way, every lever above becomes testable: did prompt caching move median TTFT? Did the faster host help p95? Did racing tighten the p99? Without the numbers, you're optimizing on vibes.

Putting it together

Improving time to first token is a stack of compounding moves, not one switch: stream so the wait collapses to TTFT, shrink and cache the prompt to cut prefill, pick a fast-enough model on a fast inference host, keep connections warm and regional, race the slow tail, and measure all of it on the right boundaries. The hard parts — opt-in caching, parallel racing with first-token-wins and disciplined cancellation, and true per-call TTFT accounting — are plumbing you'd otherwise rebuild in every service. Owning that behavior behind one stable endpoint is the job of an LLM gateway, and TTFT is one piece of the broader effort to reduce LLM latency.

flo2 is a developer-first, bring-your-own-key LLM gateway built for this: one OpenAI- and Anthropic-compatible key that routes each request to the fastest model, opt-in response caching, AI racing with a configurable head start and first-token-wins, and real per-call latency accounting — with zero markup on your own provider keys, so you can add Groq, Cerebras, and the rest and benchmark their TTFT head to head. It's free during Beta, so you can start shaving your real first-token latency today.

One key, every model — zero markup.

Bring your own provider keys. flo2 routes to the cheapest, fastest model with fallback, racing and true cost accounting — free during Beta.

Get your flo2 key →