Input vs Output Tokens: Why They're Priced Differently
Every LLM API invoice is really two separate bills disguised as one. Understanding input vs output tokens — what they are, why they're priced differently, and how cached input fits in — is the single most actionable piece of LLM cost literacy a developer can have. Get it wrong and you'll optimise the wrong side of the equation; get it right and you can cut your monthly bill in half without switching models.
What are input and output tokens?
When you make a chat completion request, you send text in and receive text back. The provider counts both sides in tokens — roughly three to four characters per token for English prose, though code and non-Latin scripts tokenise differently.
- Input tokens (also called prompt tokens) — everything you send: the system prompt, few-shot examples, conversation history, retrieved context, and the user's message. You control these entirely before the request fires.
- Output tokens (also called completion tokens) — every token the model generates in its response. You influence these through prompting and the
max_tokensparameter, but you don't control them the way you control input. - Cached input tokens — a subset of your input tokens that a provider has already seen and stored from a previous request. Most major providers offer a meaningful discount on cached tokens because they don't need to run the full attention computation over them again.
Most provider APIs return all three counts in the usage object of every response. If you're not logging all three separately per call, you're flying blind on cost.
Why output tokens cost more than input tokens
This surprises a lot of developers the first time they see it. Intuitively it feels like sending a 10,000-token context should be the expensive part, but output tokens are almost always priced two to five times higher than input tokens. The reason is how autoregressive generation works.
During the prefill phase (processing your input), the model runs a forward pass over all your input tokens in parallel. Modern hardware is excellent at this — the GPU can compute attention across thousands of tokens simultaneously. It's relatively cheap per token.
During the decode phase (generating your output), the model must produce one token at a time. Each new token depends on every token before it, so generation is inherently sequential. The GPU sits mostly idle between steps, waiting for memory bandwidth rather than crunching matmuls at full utilisation. You're paying for time-on-hardware, and sequential decode is expensive time.
This is why output tokens cost more, and why the same underlying model can produce wildly different bills depending on your token shape. A classifier that sends 800 input tokens and returns a single word has a completely different cost profile than an agent step that returns 1,500 tokens of reasoning. Understanding this asymmetry is the foundation of all prompt-level cost optimisation.
Cached input tokens: the discount most developers underuse
Prompt caching (also called context caching) lets providers skip re-computing attention over the static parts of your prompt that haven't changed between requests. The classic candidates are:
- Your system prompt (usually identical across every call for a given app)
- Few-shot examples baked into the prompt
- Large retrieved documents sent repeatedly
- Conversation history prefix that hasn't changed
When a cache hit occurs, the provider bills cached input tokens at a heavily discounted rate — typically 50–90% cheaper than regular input tokens. Anthropic's Claude charges roughly 10% of the standard input price for cache hits. OpenAI's prompt caching is applied automatically on eligible prefixes. The mechanics differ, but the principle is the same: stable prefixes are cheap; only new tokens are expensive.
The catch is that the cached prefix must be byte-for-byte identical and long enough to qualify (most providers require at least a few hundred tokens). Put your variable content — the user's message, dynamic context — at the end of the prompt, not the beginning, so the stable prefix stays unbroken. For a deeper look at how tokens and context windows interact, see our guide to LLM context windows.
Illustrative cost example
The table below uses made-up round numbers labelled as illustrative — real model prices vary and change frequently. The point is the shape of the calculation, not the exact figures.
| Token type | Count | Illustrative rate (per 1M tokens) | Illustrative cost |
|---|---|---|---|
| Input (new) | 200 | $3.00 | $0.00060 |
| Input (cached) | 4,000 | $0.30 (90% discount) | $0.00120 |
| Output | 500 | $15.00 (5× input rate) | $0.00750 |
| Total | 4,700 | — | $0.00930 |
Note that 500 output tokens account for roughly 81% of the total cost, even though they're only 11% of the total token count. This is the asymmetry you must design around. If you had disabled caching, those 4,000 cached input tokens at full price would have added another $0.012 — more than doubling the call cost.
How to estimate your token counts before you call
You don't have to fire a live API call to know roughly how many tokens your prompt will use. All major providers publish open-source tokenisers:
- OpenAI models — use the tiktoken library (
cl100k_basefor GPT-4 family,o200k_basefor the o-series). - Anthropic Claude — the Anthropic SDK exposes a
client.beta.messages.count_tokens()method that returns exact counts before generation. - Open-weight models — use HuggingFace Transformers'
AutoTokenizerfor the model's specific tokeniser.
Integrating a token count step into your development loop — logging prompt token counts during testing — surfaces expensive prompts before they hit production. For an accessible explanation of why tokenomics matters across your whole LLM stack, see AI tokenomics.
Practical optimisation techniques
Trim and tighten your input
Every input token you cut saves money at the input rate. More importantly, it often frees room in the context window for the tokens that actually matter. Audit your system prompt regularly — most accumulate instructions that were added for an edge case and never removed. Summarise long chat histories rather than appending them indefinitely. If you're doing RAG, retrieve fewer, higher-ranked chunks rather than everything above a loose threshold.
Cap max_tokens aggressively
Set max_tokens to roughly what the task actually needs, not an unlimited ceiling. A task that returns a JSON object with three fields doesn't need 2,000 tokens of room. Leaving it uncapped is an invitation for models to pad their responses — and you pay for every padding token at output prices. Where appropriate, instruct the model explicitly to be concise: "Respond in three sentences or fewer" or "Return only valid JSON, no explanation."
Ask for structured, terse output
Prose is verbose. If you only need a classification label, an extracted field, or a yes/no decision, ask for exactly that. Structured output modes (JSON mode, tool-calling with a schema) constrain the model to what you need and cut the decorative text it might otherwise generate. Fewer output tokens is cheaper output.
Maximise cache hit rate
Order your prompt so the stable content comes first and variable content comes last. Keep your system prompt byte-stable across calls — even minor whitespace changes will break the cache. If you're building a multi-turn chat app, prepend any large, static context (persona, tools, domain knowledge) before the conversation history, and keep that prefix constant.
Tracking the input/output split per call
Aggregate dashboard views hide the token shape problem. A weekly total of 50M tokens tells you nothing about whether a new feature is spending five times what it should on output. Log at the per-call level:
usage.prompt_tokens(or equivalent) — your inputusage.completion_tokens— your outputusage.prompt_tokens_details.cached_tokens(OpenAI) orusage.cache_read_input_tokens(Anthropic) — the cached subset- Compute cost per call =
(new_input × input_rate) + (cached_input × cache_rate) + (output × output_rate)
Attribute those costs to the route, feature, or user that generated them. Once you can see "this agent step costs $0.02 per invocation and fires 1,000 times a day," you have a concrete target and can measure whether your optimisations actually move the number.
This is exactly the visibility gap that flo2 is built to close. flo2 is a developer-first LLM gateway with zero token markup — you bring your own provider keys and pay providers directly, so there's no per-token surcharge on top. Every API response is logged with its full token split: new input, cached input, and output counted separately, with computed cost per call surfaced immediately. You get true per-call cost accounting across every model you route to, without wiring up the accounting layer yourself. Try it free during beta.