Tokens Per Second (tok/s): What It Means for LLM Speed
Every LLM API response has two phases: the wait for the first character, and the stream that follows. Tokens per second — often written as tok/s or TPS — governs that second phase, measuring how fast a model generates output once it has started. Understanding tok/s is essential for any developer optimizing latency budgets, cost models, or user experience on top of LLMs, because it sets a hard floor on how long any non-trivial response can take.
What "tokens per second" actually measures
A token is the atomic unit of text that LLMs work in — roughly four characters in English, though the exact size depends on the model's tokenizer. When a model generates a 400-token answer, it produces those tokens one at a time in a sequential loop called decoding. Tokens per second is the rate of that loop: how many tokens the model emits per wall-clock second during generation.
The typical shorthand "tok/s" almost always refers specifically to decode throughput — the output generation rate — not to any other phase of inference. A model running at 80 tok/s will finish a 400-token answer in five seconds from the moment it starts generating. At 200 tok/s, the same answer takes two seconds. The arithmetic is direct.
Prefill vs. decode: two distinct phases
LLM inference has two sequential stages, and confusing them is a common source of measurement error.
- Prefill is when the model reads and encodes your entire prompt — system message, conversation history, retrieved documents, tool schemas. The model processes all input tokens in parallel, which makes prefill fast per-token but still significant for large contexts. Prefill time is the primary driver of time to first token (TTFT). A 1,000-token prompt prefills much faster than a 30,000-token RAG context; the latter adds real latency before any output appears.
- Decode is the generation loop. After prefill completes, the model emits one token, feeds it back into context, and repeats — sequentially, one step at a time. This is fundamentally slower per token than prefill because each step depends on the previous one. Decode throughput, measured in tok/s, is what benchmarks and providers advertise when they quote output speed.
The distinction matters in practice. A provider can have excellent TTFT (fast prefill, low queue latency) but slow decode, making long responses feel sluggish even when short ones feel snappy. Optimizing for each phase often requires different techniques.
How tok/s combines with TTFT to give total latency
Total response time is not a single number — it's a sum:
total_latency ≈ TTFT + (output_tokens ÷ tokens_per_second)
TTFT = network_round_trip + provider_queue + prefill
generation = output_tokens ÷ tok/s
A concrete example: 300 ms TTFT at 60 tok/s produces a 240-token answer in 4.3 seconds total. Double the tok/s to 120 and it drops to 2.3 seconds. Halve the output length and you're at 1.3 seconds at the original speed. The three levers — TTFT, tok/s, output length — combine multiplicatively, so improvements stack.
For streaming UIs, perceived latency is dominated by TTFT; tok/s governs how fluidly text flows after that. For batch workloads where you need the full response before acting, total latency matters most, and tok/s becomes the dominant term for long outputs.
What determines a model's tokens per second
Several factors govern decode throughput, and most of them are out of the application developer's hands — but knowing them helps you choose providers wisely.
Model size and architecture
Larger models are slower. A 70B-parameter model moves far more data through memory per decode step than a 7B model — lower tok/s on equivalent hardware. Architecture choices (attention mechanism, MoE routing, layer count) also affect speed independently of parameter count. Smaller "mini" or "flash" class models routinely outrun their larger siblings by wide margins.
Hardware and inference stack
The same model weights can run at dramatically different token rates depending on the silicon and software serving them. Memory bandwidth is the primary decode bottleneck, and specialized AI accelerators far exceed general-purpose GPU clusters in this dimension. The serving software — continuous batching, flash attention, optimized kernels — adds further leverage. A purpose-built inference host can benchmark substantially faster than a general-purpose cloud GPU running the same weights.
Quantization
Quantization reduces weight precision (e.g., 16-bit to 8-bit integers), letting more weights move through memory per cycle. This often increases tok/s meaningfully with limited quality impact. Most providers apply it by default; some offer tiers to let you trade quality against speed.
Request batching and concurrent load
Inference servers batch multiple requests together to amortize each decoding step. At low load you may achieve near-peak tok/s; under heavy concurrency, batching increases total server throughput but can reduce per-request tok/s. This is why the same model at the same provider shows wide variance by time of day.
Why tok/s matters for UX and cost
In any streaming chat interface, tok/s controls the AI's "typing speed" — how fast words appear after the first token arrives. Humans read English at roughly 200–250 words per minute, so a model generating text faster than that is effectively invisible in terms of decode lag. Models well below that threshold produce a visibly slow drip. For non-streaming workloads — tool calls, structured outputs, batch pipelines — higher tok/s reduces wall-clock latency, which directly affects downstream throughput.
LLM APIs charge per token, not per second. But for time-bounded workloads with hard latency SLAs, tok/s sets your effective token throughput capacity. A faster provider for the same model increases that budget without changing per-token cost. When you reduce LLM latency through provider choice, you often gain concurrency headroom at the same time.
How to measure tok/s accurately
The most accurate method: stream the response, count output tokens from the API's usage metadata, and divide by time from first token to last. That isolates decode cleanly, excluding TTFT.
- Sample across conditions. Tok/s varies with concurrent load and time of day. One measurement is noise — collect enough samples to build a distribution.
- Measure from your deployment region. Provider hardware varies by region, and your network context affects results. Benchmark from production-equivalent conditions.
- Benchmark multiple providers. The fastest provider for one model may not be the fastest for another, and the gap shifts as infrastructure evolves.
Specialized providers can be far faster — and you can race them
Provider choice is one of the highest-leverage dials for tok/s. Specialized inference providers purpose-built for LLM throughput can benchmark at rates that make general-purpose endpoints look slow for the same model. In published benchmarks, purpose-built stacks frequently outperform general-purpose GPU clouds by factors that matter for interactive use cases.
The challenge: the fastest provider shifts over time and varies by model, region, and load. The robust solution is to race providers — send each request to multiple endpoints simultaneously and stream from whichever responds first. This captures peak tok/s without manual benchmarking or complex failover logic.
flo2 is a developer-first LLM gateway built for this workflow. Bring your own provider keys, pay zero token markup, and enable AI racing to take the fastest provider on every call. Throughput logging gives you real tok/s numbers in production — so you catch regressions before users do. Free during beta.