2026-05-25 · flo2 blog

The Cheapest LLM API in 2026: How to Actually Pay Less per Token

If you ship anything on top of large language models, your provider invoice is now a real line item — and it grows with every user. The good news: the gap between what you currently pay and the cheapest LLM API for your workload is usually wide, and closing it is mostly engineering, not negotiation. The catch is that there is no single cheapest API. The right answer in 2026 depends on your token mix, your context size, and how high your quality bar actually needs to be.

This guide walks through how LLM API pricing works, where the cheap tokens live in 2026, and the concrete tactics that reduce LLM costs without wrecking output quality.

Why "cheapest" depends on the task, not the provider

LLM pricing is quoted per million tokens, split into input (your prompt) and output (the model's reply). Output tokens almost always cost more than input — often two to five times more. That single fact reshapes the whole question, because two apps using "the same model" can have completely different bills.

Input-heavy, output-light (classification, extraction, routing, "yes/no" judgments): you send a lot of context and get back a few tokens. Cheap input pricing matters most.
Output-heavy (long-form drafting, code generation, agents that monologue): the reply dominates. A model with cheap input but pricey output can quietly cost more than a "more expensive" model with balanced rates.
Large-context (RAG over big documents, long chat histories): you pay for every input token on every call. A 30k-token context resent 50 times in a conversation is 1.5M input tokens — for one user session.

The other axis is the quality bar. A frontier model is overkill for "extract the invoice date from this text," but appropriate for "refactor this 400-line module." Paying frontier prices for trivial tasks is the most common way teams overspend. The cheapest LLM API for a task is the lowest-priced model that still clears the bar for that specific task — which means you'll use several models, not one.

LLM API pricing in 2026: the rough tiers

Exact numbers move constantly, so treat the following as orientation and check each provider's pricing page before you commit. That said, the market has settled into recognizable tiers by price per million tokens.

Tier	Rough $/M input	Rough $/M output	Typical fit
Frontier closed (top OpenAI / Anthropic / Gemini models)	~$2–$15+	~$8–$75+	Hard reasoning, long agentic chains, high-stakes output
Mid-tier closed ("mini"/"flash"-class)	~$0.10–$1	~$0.40–$5	Everyday production tasks, good quality-per-dollar
Open-weight on fast inference hosts (Groq, Cerebras, DeepInfra)	~$0.05–$0.90	~$0.08–$1.50	Classification, extraction, summaries, drafts, high-volume work

The headline of 2026 is that open-weight models served on specialized inference hardware are dramatically cheaper than frontier closed models — frequently an order of magnitude — and on platforms like Groq and Cerebras they're also extremely fast. For a large share of real workloads (the easy 70–80%), an open model is not a compromise; it's simply the correct, cheaper tool. Providers like Mistral, DeepInfra, and xAI further widen the menu, which is exactly why a multi-provider strategy beats betting everything on one vendor's price list.

Tactics that actually reduce LLM costs

1. Route cheap-model-first, with a quality fallback

Instead of sending everything to one premium model, try a cheap open-weight model first and escalate only when needed. The pattern: attempt the task on an inexpensive model, validate the result (schema check, confidence signal, a quick rubric), and fall back to a stronger model only on failure. Because most requests are easy, most never touch the expensive tier — yet hard cases still get frontier quality. This single change often cuts spend by half or more.

2. Cache repeated work

Two kinds of caching matter. Prompt caching (offered natively by several providers) discounts the large, static portion of your prompt — system instructions, few-shot examples, retrieved context — when it repeats across calls. Response caching returns a stored answer for an identical request, dropping the cost of that repeat call to essentially zero. If your traffic has any duplication at all — FAQ-style queries, re-runs, idempotent pipeline steps — response caching is close to free money.

3. Trim `max_tokens` and context

Since output is the expensive side, cap it. Set max_tokens to what the task genuinely needs; "let it ramble" is a line item. On the input side, prune context aggressively: retrieve fewer, better chunks for RAG, summarize or truncate long chat histories, and drop boilerplate. Asking for structured output (JSON, short fields) instead of prose cuts output tokens too.

4. Batch and pick the right model class

Many providers offer batch endpoints at a meaningful discount (often around 50%) for non-urgent work — overnight enrichment, evals, bulk summarization. And for genuinely easy tasks, deliberately choose an open-weight or "mini"-class model rather than defaulting to the flagship. Matching model class to task difficulty is the highest-leverage habit you can build.

The reseller-markup trap vs. bring-your-own-key

Many aggregators that hand you one convenient API key are reselling tokens: they buy from the provider and add a margin, or bake a spread into per-token rates. It's frictionless, but you're paying a tax on every single call, forever — and the true provider cost is hidden from you, so you can't tell whether your optimizations are working.

The bring-your-own-key (BYOK) model flips this. You hold accounts directly with OpenAI, Anthropic, Groq, and the rest, and a gateway routes through your keys. You pay providers at list price with zero markup, and you keep full visibility into what each call actually cost. For anything beyond hobby volume, BYOK is the cheaper long-run path almost every time.

A worked example

Say a support assistant handles 1,000,000 requests/month. Each request: ~800 input tokens, ~200 output tokens.

Baseline — everything on a frontier model at ~$5/M input and ~$15/M output:

Input: 800M tokens × $5/M = $4,000
Output: 200M tokens × $15/M = $3,000
Monthly total: ~$7,000

Optimized — route 80% to an open-weight model at ~$0.20/M in / $0.30/M out, escalate 20% to the frontier model, and serve 15% of all traffic from a response cache:

Cached 150k requests: ~$0 incremental.
Cheap tier (~680k requests): ~544M in × $0.20/M + ~136M out × $0.30/M ≈ $109 + $41 = $150.
Frontier tier (~170k requests): ~136M in × $5/M + ~34M out × $15/M ≈ $680 + $510 = $1,190.
Monthly total: ~$1,340.

Same product, roughly 80% lower bill — from routing, caching, and not overpaying for the easy 80%. Your real numbers will differ with your token mix, but the shape holds: the savings come from how you spend, not from finding one magic provider.

Why a gateway makes this measurable

You can hand-roll routing and caching, but you'll want to prove the savings, not hope for them. A gateway that records tokens and computed cost per call turns "we think this helped" into a number. The pieces that move the needle — assign each model its true per-token price, route cheapest-first, fall back or race for latency, cache opt-in — are exactly what a router is built to do, and exposing real cost per call is what keeps the optimization honest over time.

That's the niche flo2 fills: a developer-first, BYOK LLM gateway that gives you one OpenAI- and Anthropic-compatible key, routes each request to the cheapest model that meets your bar, and logs the true cost per call with zero markup — you pay providers directly. It's a zero-markup OpenRouter alternative, free during beta, so you can see your real LLM API costs and start cutting them today.

One key, every model — zero markup.

Bring your own provider keys. flo2 routes to the cheapest, fastest model with fallback, racing and true cost accounting — free during Beta.

Get your flo2 key →