LLM API Pricing Compared (2026): How to Read & Compare Costs
If you're evaluating LLM APIs for a production app, llm api pricing will look deceptively simple at first glance — a few dollars per million tokens. Look closer and you'll find at least six variables that interact to produce your actual invoice: input price, output price, cached-input discounts, context limits, batch discounts, and surcharges for images or audio. Missing even one can make a cheap model unexpectedly expensive for your workload. This guide explains how the pricing mechanics actually work, where headline rates mislead, and how to build an apples-to-apples comparison before you commit.
How LLM API Pricing Actually Works
Every LLM API charges by the token. A token is roughly four characters of English text, so "large language model" is about five tokens. Providers quote rates per million tokens — usually written as $/M — but that single headline splits into at least three separate line items on your bill.
Input vs. output tokens
Your prompt (system message, retrieved context, user turn, few-shot examples) is billed as input tokens. The model's reply is billed as output tokens. Output almost always costs more than input — commonly two to five times more — because generation is the expensive, sequential part of inference. That gap matters enormously: a long-form drafting or code generation app that produces 1,500 output tokens per call has a fundamentally different cost profile than a classifier that returns one word.
Cached-input discounts
Most major providers now offer a discounted rate for the static portion of a prompt when it repeats across calls — sometimes called prompt caching or context caching. The discount is often 50–90% on those cached tokens. If your system prompt, RAG retrieval, or few-shot examples stay the same across many requests, designing around cache hits can cut your input spend dramatically. Not every provider offers caching, and the rules for what qualifies differ, so it's a dimension worth comparing explicitly.
Context limits
A larger context window is not free. When you send a 32 k-token context, you pay for 32 k input tokens on every single call. A long chat history resent in full for 50 turns in one conversation is 1.6M input tokens — for one user. Context length is a multiplier on your input cost, and it tends to surprise teams that design locally with short prompts but go to production with real-world conversation lengths.
Batch discounts
Several providers offer an asynchronous batch API that accepts large jobs and returns results within a defined window (often 24 hours). The tradeoff is latency for price — batch rates are typically 50% below synchronous pricing. If you have offline workloads like evaluation runs, document processing pipelines, or nightly report generation, batch is worth a separate comparison.
Image and audio surcharges
Multimodal inputs are usually priced separately and can be significantly more expensive than text tokens. A high-resolution image submitted to a vision model may cost the equivalent of several hundred text tokens depending on how the provider tiles and processes it. If your app handles documents, screenshots, or voice, add this dimension to your comparison — it can dominate the bill for media-heavy workloads.
Why Headline Prices Mislead
Providers publish the number that looks best in a comparison table. That number is almost always the input rate on a small-context, synchronous call — optimized to look competitive. Several factors make the real cost diverge from the headline.
Output-heavy workloads
A model with a $0.50/M input rate and a $15/M output rate looks cheap on input and expensive on output. If your tasks generate 500 output tokens for every 200 input tokens, the output side dominates — and that "cheap" model costs more per task than a model priced at $1/M in and $5/M out. You cannot rank models by input rate alone and expect the ranking to hold for your workload.
Retries and failures
A cheaper model with lower reliability forces retries — sometimes to a more expensive fallback. The true cost of a task is not one call; it's the average cost across all attempts including the failed ones. A model that costs half as much per token but fails 20% of the time and triggers a fallback to a frontier model can easily cost more per successful task. See the AI tokenomics primer for how to calculate cost per successful task correctly.
Long context amplification
A model priced at $1/M input looks fine until you're passing 50k tokens per call. At that scale, a competitor priced at $0.80/M input saves real money, and the difference compounds with volume. Long-context workloads amplify small per-token differences into large invoice differences.
Markup on aggregators
Some LLM API aggregators and gateways resell tokens with a markup on top of the provider's list price — anywhere from a small percentage to 20% or more. When you compare against a provider's published rate, you may not be comparing the same thing. Bring-your-own-key (BYOK) services that route your calls using your own provider credentials pay list price with zero markup, which is the only honest baseline for comparison. For more on this pattern, see the cheapest LLM API guide.
How to Compare Apples-to-Apples
A rigorous LLM pricing comparison requires knowing your own numbers first. Grab a representative sample of real calls — or realistic estimates — and compute these four values per task type:
- Average input tokens per call — include the full system prompt, context, and user message.
- Average output tokens per call — measure from production logs or a realistic benchmark, not the model's maximum.
- Cache hit rate — what fraction of your input tokens repeat across calls and would qualify for a cached-input discount?
- Calls per task — how many API calls does one end-user task require, including retries and multi-step chains?
With those numbers, the formula for effective cost per task on a given model is straightforward:
cost per task = calls per task × [(non-cached input tokens × input rate) + (cached input tokens × cached rate) + (output tokens × output rate)]
Run that formula across the models you're evaluating. The ranking will often differ from what you'd get by sorting on input rate alone.
The Pricing Dimensions to Compare
Use this table as a checklist when you pull numbers from a provider's pricing page.
| Dimension | What to look for | Why it matters |
|---|---|---|
| Input token rate ($/M) | Standard synchronous rate for prompt tokens | Baseline; dominates input-heavy tasks |
| Output token rate ($/M) | Rate for generated tokens; compare to input rate | Dominates for drafting, code gen, agents |
| Cached-input rate ($/M) | Discounted rate for repeated static prompt sections | Can cut 50–90% of input cost if you design for it |
| Cache eligibility rules | Minimum token length, TTL, per-call vs. stored | Not all prompts qualify; rules differ by provider |
| Batch / async rate ($/M) | Discounted rate for non-real-time jobs | Often ~50% off; worth it for offline pipelines |
| Context window size | Maximum tokens per call (input + output) | Larger windows cost more to fill; plan accordingly |
| Image / audio pricing | Per-tile, per-second, or token-equivalent rates | Can dominate bill for multimodal apps |
| Markup vs. list price | Is the intermediary adding a percentage on top? | BYOK gateways pay list; resellers charge more |
| Rate limits | Tokens per minute, requests per minute by tier | Hitting limits forces retries or fallbacks — a hidden cost |
Where to Find Live Prices
Published prices change — sometimes quarterly, sometimes without announcement. Never hard-code a price from a blog post (including this one) into a business decision. Go to the source:
- flo2's /llm-pricing page aggregates current rates across providers in one place, so you can compare models without bouncing between tabs. Because flo2 routes using your own provider keys with zero token markup, the prices shown are actual list prices — not reseller rates.
- Provider pricing pages — OpenAI, Anthropic, Google (Gemini), Mistral, Groq, Cerebras, DeepInfra, and others each publish their own rate cards. Always verify against the provider's own page before committing to a model for production.
The /llm-pricing page is particularly useful for llm pricing comparison across input, output, and cached-input dimensions in a single table — exactly the format the formula above requires.
Practical Steps Before You Commit
A fast workflow that avoids sticker-shock invoices later:
- Profile your token shape first. Log 100 real or simulated calls. Compute average input, output, and whether your static prompt sections would qualify for caching.
- Score on cost per successful task, not cost per token. Run a small eval — maybe 50–100 examples — on each candidate model. Factor in failure rate and retry behavior.
- Compare batch vs. synchronous. If any of your workflows tolerate latency (evals, enrichment, reporting), price them under the batch rate.
- Check the markup layer. If you use a gateway or proxy, confirm whether it adds a percentage on top of provider rates. BYOK gateways that charge no token markup are the only way to ensure you're paying list price.
- Set a cost budget per task in code. Having a number — "this classification call should cost under $0.002" — makes monitoring straightforward and catches regressions early.
Bringing It Together
LLM API pricing is not a single number. It is a product of your input/output token ratio, your cache hit rate, your context length, your retry behavior, and whether the intermediary you're using adds a markup. Headline rates are useful for orientation but dangerous as a decision criterion. The comparison that matters is cost per successful task on your actual workload — which requires measuring your own token shape and running the numbers through each provider's full rate card.
flo2's pricing page gives you the current rate table across providers. flo2 itself routes your calls to the cheapest or fastest model that fits your constraints, using your own provider keys with zero token markup — meaning the list prices you see are the prices you pay. During Beta, the flo2 gateway layer itself is free.
For deeper background on unit economics and cost-per-task modeling, see the AI tokenomics guide. For a direct model-by-model cost breakdown by task type, see the cheapest LLM API comparison.