2026-06-03 · flo2 blog

Input vs Output Tokens: Why They're Priced Differently

Every LLM API invoice is really two separate bills disguised as one. Understanding input vs output tokens — what they are, why they're priced differently, and how cached input fits in — is the single most actionable piece of LLM cost literacy a developer can have. Get it wrong and you'll optimise the wrong side of the equation; get it right and you can cut your monthly bill in half without switching models.

What are input and output tokens?

When you make a chat completion request, you send text in and receive text back. The provider counts both sides in tokens — roughly three to four characters per token for English prose, though code and non-Latin scripts tokenise differently.

Input tokens (also called prompt tokens) — everything you send: the system prompt, few-shot examples, conversation history, retrieved context, and the user's message. You control these entirely before the request fires.
Output tokens (also called completion tokens) — every token the model generates in its response. You influence these through prompting and the max_tokens parameter, but you don't control them the way you control input.
Cached input tokens — a subset of your input tokens that a provider has already seen and stored from a previous request. Most major providers offer a meaningful discount on cached tokens because they don't need to run the full attention computation over them again.

Most provider APIs return all three counts in the usage object of every response. If you're not logging all three separately per call, you're flying blind on cost.

Why output tokens cost more than input tokens

This surprises a lot of developers the first time they see it. Intuitively it feels like sending a 10,000-token context should be the expensive part, but output tokens are almost always priced two to five times higher than input tokens. The reason is how autoregressive generation works.

During the prefill phase (processing your input), the model runs a forward pass over all your input tokens in parallel. Modern hardware is excellent at this — the GPU can compute attention across thousands of tokens simultaneously. It's relatively cheap per token.

During the decode phase (generating your output), the model must produce one token at a time. Each new token depends on every token before it, so generation is inherently sequential. The GPU sits mostly idle between steps, waiting for memory bandwidth rather than crunching matmuls at full utilisation. You're paying for time-on-hardware, and sequential decode is expensive time.

This is why output tokens cost more, and why the same underlying model can produce wildly different bills depending on your token shape. A classifier that sends 800 input tokens and returns a single word has a completely different cost profile than an agent step that returns 1,500 tokens of reasoning. Understanding this asymmetry is the foundation of all prompt-level cost optimisation.

Cached input tokens: the discount most developers underuse

Prompt caching (also called context caching) lets providers skip re-computing attention over the static parts of your prompt that haven't changed between requests. The classic candidates are:

Your system prompt (usually identical across every call for a given app)
Few-shot examples baked into the prompt
Large retrieved documents sent repeatedly
Conversation history prefix that hasn't changed

When a cache hit occurs, the provider bills cached input tokens at a heavily discounted rate — typically 50–90% cheaper than regular input tokens. Anthropic's Claude charges roughly 10% of the standard input price for cache hits. OpenAI's prompt caching is applied automatically on eligible prefixes. The mechanics differ, but the principle is the same: stable prefixes are cheap; only new tokens are expensive.

The catch is that the cached prefix must be byte-for-byte identical and long enough to qualify (most providers require at least a few hundred tokens). Put your variable content — the user's message, dynamic context — at the end of the prompt, not the beginning, so the stable prefix stays unbroken. For a deeper look at how tokens and context windows interact, see our guide to LLM context windows.

Illustrative cost example

The table below uses made-up round numbers labelled as illustrative — real model prices vary and change frequently. The point is the shape of the calculation, not the exact figures.

Token type	Count	Illustrative rate (per 1M tokens)	Illustrative cost
Input (new)	200	$3.00	$0.00060
Input (cached)	4,000	$0.30 (90% discount)	$0.00120
Output	500	$15.00 (5× input rate)	$0.00750
Total	4,700	—	$0.00930

Note that 500 output tokens account for roughly 81% of the total cost, even though they're only 11% of the total token count. This is the asymmetry you must design around. If you had disabled caching, those 4,000 cached input tokens at full price would have added another $0.012 — more than doubling the call cost.

How to estimate your token counts before you call

You don't have to fire a live API call to know roughly how many tokens your prompt will use. All major providers publish open-source tokenisers:

OpenAI models — use the tiktoken library (cl100k_base for GPT-4 family, o200k_base for the o-series).
Anthropic Claude — the Anthropic SDK exposes a client.beta.messages.count_tokens() method that returns exact counts before generation.
Open-weight models — use HuggingFace Transformers' AutoTokenizer for the model's specific tokeniser.

Integrating a token count step into your development loop — logging prompt token counts during testing — surfaces expensive prompts before they hit production. For an accessible explanation of why tokenomics matters across your whole LLM stack, see AI tokenomics.

Practical optimisation techniques

Trim and tighten your input

Every input token you cut saves money at the input rate. More importantly, it often frees room in the context window for the tokens that actually matter. Audit your system prompt regularly — most accumulate instructions that were added for an edge case and never removed. Summarise long chat histories rather than appending them indefinitely. If you're doing RAG, retrieve fewer, higher-ranked chunks rather than everything above a loose threshold.

Cap max_tokens aggressively

Set max_tokens to roughly what the task actually needs, not an unlimited ceiling. A task that returns a JSON object with three fields doesn't need 2,000 tokens of room. Leaving it uncapped is an invitation for models to pad their responses — and you pay for every padding token at output prices. Where appropriate, instruct the model explicitly to be concise: "Respond in three sentences or fewer" or "Return only valid JSON, no explanation."

Ask for structured, terse output

Prose is verbose. If you only need a classification label, an extracted field, or a yes/no decision, ask for exactly that. Structured output modes (JSON mode, tool-calling with a schema) constrain the model to what you need and cut the decorative text it might otherwise generate. Fewer output tokens is cheaper output.

Maximise cache hit rate

Order your prompt so the stable content comes first and variable content comes last. Keep your system prompt byte-stable across calls — even minor whitespace changes will break the cache. If you're building a multi-turn chat app, prepend any large, static context (persona, tools, domain knowledge) before the conversation history, and keep that prefix constant.

Tracking the input/output split per call

Aggregate dashboard views hide the token shape problem. A weekly total of 50M tokens tells you nothing about whether a new feature is spending five times what it should on output. Log at the per-call level:

usage.prompt_tokens (or equivalent) — your input
usage.completion_tokens — your output
usage.prompt_tokens_details.cached_tokens (OpenAI) or usage.cache_read_input_tokens (Anthropic) — the cached subset
Compute cost per call = (new_input × input_rate) + (cached_input × cache_rate) + (output × output_rate)

Attribute those costs to the route, feature, or user that generated them. Once you can see "this agent step costs $0.02 per invocation and fires 1,000 times a day," you have a concrete target and can measure whether your optimisations actually move the number.

This is exactly the visibility gap that flo2 is built to close. flo2 is a developer-first LLM gateway with zero token markup — you bring your own provider keys and pay providers directly, so there's no per-token surcharge on top. Every API response is logged with its full token split: new input, cached input, and output counted separately, with computed cost per call surfaced immediately. You get true per-call cost accounting across every model you route to, without wiring up the accounting layer yourself. Try it free during beta.

One key, every model — zero markup.

Bring your own provider keys. flo2 routes to the cheapest, fastest model with fallback, racing and true cost accounting — free during Beta.

Get your flo2 key →