2026-06-03 · flo2 blog

Prompt Caching: How It Cuts LLM Cost & Latency

Every time your application sends a long system prompt, a dense set of few-shot examples, or a multi-page document to an LLM, the model processes those tokens from scratch — unless your provider has already seen them. Prompt caching is the mechanism that lets providers skip that re-processing: the processed prefix of your prompt is stored inside the inference cluster, so repeated context arrives cheaper and faster. Getting prompt caching right is one of the highest-leverage cost moves available to developers building on LLMs today.

This article covers what prompt caching is, how it differs from response caching, how to structure your prompts to maximize cache hits, how each major provider implements it, and how to track cached token savings in your cost accounting.

What is prompt caching?

When a model processes your prompt, it converts each token into a set of internal representations (key-value pairs in the attention mechanism). Normally those representations are computed fresh on every call. Prompt caching stores those KV representations server-side so that when you send the same prefix again, the provider can skip the computation and inject the pre-built representations directly into the inference pipeline.

The result has two effects you can measure:

Prompt caching is a provider-side optimization. It operates inside the model's inference stack, not at your application or gateway layer. That's the key distinction from response caching.

Prompt caching vs. response caching: what's the difference?

These two mechanisms are often confused but solve different problems.

Property Prompt caching Response caching
Where it runs Inside the provider's inference cluster At the gateway / proxy layer
What is reused Processed prefix (KV cache) of the prompt The complete response from a prior call
Works with variable requests? Yes — only the stable prefix is cached; the dynamic suffix still runs No — the full request must be identical to hit the cache
Still calls the model? Yes — the model still generates a fresh response No — the stored response is returned directly
Safe for non-deterministic outputs? Yes Only at temperature = 0 or for idempotent tasks

Prompt caching is safe to use broadly — you still get a fresh generation every time, just with cheaper and faster prefix processing. Response caching is more powerful (zero token cost, near-zero latency) but requires that the same answer is acceptable every time. For a deeper look at response caching, see the LLM response caching guide. The two optimizations are complementary: use prompt caching to reduce prefix cost on live calls, and layer response caching on top for fully static request/response pairs.

How to structure prompts for maximum cache hits

The core rule is simple: stable content first, dynamic content last.

Providers cache a prefix of the prompt — a contiguous block starting from the beginning of the context window. If your system prompt is 2,000 tokens and appears at the top of every call, those 2,000 tokens can be cached and reused across every request. If you insert a dynamic user ID or timestamp anywhere inside that prefix, the hash changes and the cache misses.

A practical ordering

An anti-pattern to avoid: injecting session-specific metadata (user name, account ID, timestamp) into the system prompt. This blows the cache for every user. Move that information to a separate user message at the end of the context or into a small suffix after the static system prompt.

Minimum cacheable length

Providers impose a minimum prefix length before caching activates — typically in the range of 1,024 tokens, though the exact threshold varies. Short system prompts may not qualify. Padding prompts artificially to hit the minimum is counterproductive; instead, check whether your use case naturally produces prefixes long enough to benefit.

How each major provider implements prompt caching

The mechanism and economics differ meaningfully across providers. Always verify current rates and behavior in each provider's documentation — numbers shift as models evolve.

Anthropic (Claude)

Anthropic uses explicit cache markers. You tag specific message blocks in your request with "cache_control": {"type": "ephemeral"} to tell Claude where the cacheable prefix ends. This gives you precise control. The cache TTL is around five minutes, extendable by repeated use. Cached tokens are billed at a meaningfully lower rate than standard input tokens; cache write operations carry a small surcharge.

{
  "system": [
    {
      "type": "text",
      "text": "You are a helpful assistant with deep expertise in...",
      "cache_control": {"type": "ephemeral"}
    }
  ],
  "messages": [
    {"role": "user", "content": "Summarize this document: ..."}
  ]
}

OpenAI (GPT-4o, o-series)

OpenAI applies prompt caching automatically — no special markers needed. The API silently reuses cached prefixes when it detects a repeated context. You can observe cache hits in the response's usage object under prompt_tokens_details.cached_tokens. TTL is typically around one hour. The discount on cached tokens versus standard input tokens varies by model; check the pricing page for current figures.

Google (Gemini)

Gemini offers explicit context caching via a separate API resource. You create a cache object with a TTL, attach content to it, and reference it in subsequent requests. This is closer to a named, persistent resource than an automatic prefix cache. Minimum content sizes and TTL ranges differ from OpenAI and Anthropic; consult the Gemini documentation for current limits.

Other providers

Groq, Deepinfra, Fireworks, and similar inference providers vary — some pass through provider-level caching, some implement their own, some don't offer it yet. Check provider documentation and your usage response fields. The AI tokenomics guide covers how these pricing tiers stack against output token costs across the major providers.

Tracking cached tokens in your cost accounting

Prompt caching only helps your bill if you can see it working. Most providers surface cache information in the usage field of the API response:

A naive cost calculation that multiplies prompt_tokens * input_price will over-count your spend when caching is active. Accurate per-call cost accounting needs to apply the cached rate to cache_read_input_tokens, the cache-write surcharge rate to cache_creation_input_tokens, and the standard input rate only to the remaining uncached input tokens.

This matters more than it sounds. At scale, a miscounting layer will make it look like your prompt caching optimizations had no effect — because the savings are hidden in a bucket that your cost model charges at the wrong rate. Getting this right requires your gateway or observability layer to be aware of all three token buckets, not just total prompt tokens and completion tokens.

This is one reason developer teams using multiple providers benefit from a unified gateway that normalizes usage fields. A gateway that understands cached input tokens specifically — and accounts for them at the correct rate in per-call cost records — gives you an accurate picture of where your spend actually goes across providers, models, and workloads.

The realistic savings from prompt caching

The magnitude of savings depends on two factors: how long your stable prefix is relative to the total prompt, and how frequently that prefix repeats. A few scenarios that illustrate the range:

In high-hit-rate scenarios, prompt caching savings on input tokens can reduce your total API bill by 20–50% depending on the ratio of stable prefix to dynamic content and provider pricing — verify with your actual provider pricing and call mix.

Beyond cost, the latency improvement on long contexts is often the more immediately felt benefit. Shaving hundreds of milliseconds from TTFT on every agentic step or document-grounded response adds up quickly in user-perceived performance.

Getting per-call visibility without building the accounting yourself

Measuring prompt caching savings accurately — across providers that each use different field names, different rate tiers, and different caching semantics — is underdiscussed complexity. If you're routing across Anthropic, OpenAI, and Gemini, you need a normalization layer that knows how each provider reports cached tokens and applies the right price to each bucket on each call.

flo2 tracks cached input tokens natively in its per-call cost accounting, applying provider-correct rates to each token bucket automatically. Combined with zero markup on pass-through costs and bring-your-own provider keys, you get accurate cost data without building the accounting layer yourself — during beta, at no cost.

Related reading: LLM response caching — when to go further and skip the model call entirely. AI tokenomics — how cached input pricing fits into the full unit economics of an LLM-powered product.

One key, every model — zero markup.
Bring your own provider keys. flo2 routes to the cheapest, fastest model with fallback, racing and true cost accounting — free during Beta.
Get your flo2 key →
© 2026 flo2.com — the zero-markup LLM gateway & router. flow → to